Delicious Bookmark this on Delicious Share on Facebook SlashdotSlashdot It! Digg! Digg



PHP : Function Reference : String Functions : soundex

soundex

Calculate the soundex key of a string (PHP 4, PHP 5)
string soundex ( string str )

Example 2431. Soundex Examples

<?php
soundex
("Euler")       == soundex("Ellery");    // E460
soundex("Gauss")       == soundex("Ghosh");     // G200
soundex("Hilbert")     == soundex("Heilbronn"); // H416
soundex("Knuth")       == soundex("Kant");      // K530
soundex("Lloyd")       == soundex("Ladd");      // L300
soundex("Lukasiewicz") == soundex("Lissajous"); // L222
?>

Code Examples / Notes » soundex

bb79

To search for words like Clansy and Klansy, just reverse the strings:
 $s1 = "Clansy";
 $s2 = "Klansy";
 if((soundex($s1) == soundex($s2)) ||
    (soundex(strrev($s1)) == soundex(strrev($s2))))
   echo("Match");


mail

The soundex 'different letter in front' problem can be solved by using levenshtein() on the soundex codes. in my application, which is searching a database of album names for entries that match a particular user provided string, i do the following:
1. Search the database for the exact name
2. Search the database for entries where the name occurs anyway as a string
3. Search the database for entries where any of the words in the name (if the user has typed in more than one word) is present, except for little words (and, the, of etc)
4. Then, if all this fails, I go to plan b:
- calculate the levenshtein distance (levenshtein()) between the user search term and each of the entries in the database as a percentage of the length of the user search term entered
- calculate the levenshtein distance between the metphone codes of the user search term entered and each field in the database as a percentage of the length of the metaphone code of the user search term entered
- calculate the levenshtein distance between the soundex codes of the user search term entered and each field in the database as a percentage of the length of the soundex code of the original user search term entered
if any of these percentages is less than 50 (means that two soundex codes with different first letters will be accepted!!) then the entry is accepted as a possible match.


shortcut

The answer to whether soundex works except for the first letter in klancy vs clancy is to always prefix words with the same letter.
aklancy will match aclancy
bklancy will match bclancy
soundex seems to only check the 1st 2 syllables.??
ie:  spectacular matches spectacle
just a thought if you rely on soundex.
k-


cap

soundex() unfortunately is very sensitive about the first character. It is not possible to use it and have Clansy and Klansy return the same value. If you want to do a phonetic search on such names you will still need to write a routine to evaluate C452 as being similar to K452.

04-oct-2005 08:25

Since the first letter is included in the phonetic representation in the output, it is worth pointing out that if you want a soundex key to work without the problems of klansy and clansy sounding different, take the substring from the first letter, as the first letter is the main constant of the word, and the numerical value is that of the phontic structure of the word.

crchafer-php

Rewritten, maybe -- but the algorithm has some obvious
optimisations which can be done, for example...
       function text__soundex( $text ) {
               $k = ' 123 12  22455 12623 1 2 2';
               $nl = strlen( $tN = strtoupper( $text ) );
               $p = trim( $k{ ord( $tS = $tN{0} ) - 65 } );
               for( $n = 1; $n < $nl; ++$n )
                       if( ( $l = trim( $k{ ord( $tN{ $n } ) - 65 } ) ) != $p )
                               $tS .= ( $p = $l );
               return substr( $tS . '000', 0, 4 );
       }
// Notes:
// $k is the $key, essentially $SoundKey inverted
// $tN is the uppercase of the text to be optimised
// $tS is the partaully generated output
// $l is the current letter, $p the previous
// $n and $nl are iteration indicies
// 65 is ord('A'), precalculated for speed
// none ascii letters are not supported
// watch the brackets, quite a mixture here
(Code has suffered only basic tests, though it appears to
match the output of PHP's soundex(), speed untested --
though this should be /much/ faster than a4_perfect's
rewrite due to the removal of most loops and compares.)
C
2005-09-13


info

MySQL soundex (3.23.49) doesn't examine the first character at all to see whether it should be skipped. Therefore the Dutch name of The Hague, the country's government seat, 's-Gravenhage will give a soundex value of '261 in MySQL and S615 in PHP.

bestworldweb

Mysql seems to use 0 in the letter V instead of 1, which is what php uses (and most soundex functions)

php.net

Ik made the Soundex in JavaScript.
http://www.vanderharg.nl/soundex.php
Explanation of the algoritm is on the above page.
It returns two values if a name has "van der" or something alike in it. One with that in the Soundex test and one without.
The use of regular expressions makes the ectual soundex algoritm short. Two conditions of the algoritm I did remove because in this implementation they are redundant.
<script language="javascript">
var koppelteken = ""; // kan ook  - zijn.
var vv=new Array(
"de la ",
"in het ",
"in 't ",
"op den ",
"op het ",
"op de ",
"op te ",
"op 't ",
"up te",
"uit de ",
"van den ",
"van der ",
"van het ",
"van de ",
"van 't ",
"opte ",
"upte ",
"con ",
"den ",
"der ",
"ten ",
"ter ",
"van ",
"de ",
"di ",
"du ",
"la ",
"le ",
"te ",
"vd ",
"l' ",
"l'",
"'t ");
function removePrefix(name)
{
i=0;
var strippedresult = "";
while ((strippedresult=="")&&(i<vv.length))
{
if (name.substr(0,vv[i].length)==vv[i].toUpperCase())
strippedresult = name.substr(vv[i].length);
i++;
}
return strippedresult;
}
function Soundex(name)
{
if (name.length>1)
{
// zet om naar hoofdletters.
name=name.toUpperCase();

// converteer leestekens
re = new RegExp ('[ÄÀÁÃ]+', 'gi');
name = name.replace(re, 'A');
re = new RegExp ('[ËÈÉ]+', 'gi');
name = name.replace(re, 'E');
re = new RegExp ('[ÏÌÍ]+', 'gi');
name = name.replace(re, 'I');
re = new RegExp ('[ÖÒÓÕ]+', 'gi');
name = name.replace(re, 'O');
re = new RegExp ('[ÜÙÚ]+', 'gi');
name = name.replace(re, 'U');
re = new RegExp ('[Ý]+', 'gi');
name = name.replace(re, 'Y');
// behandel voorvoegsels
var result=Soundex(removePrefix(name));
if (result!="") result+="\n";

// eerste letter
result += name.substr(0,1);

// codeer de rest in cijfers
name=name.substr(1);
// haal A, E, I, O, U, H, W, en Y er uit.
re = new RegExp ('[AEIOUHWY ]+', 'gi');
name = name.replace(re, '');

// Zet om naar cijfers met tabel.
// haal dubbele cijfers er uit.
re = new RegExp ('[BFPV]+', 'gi');
name = name.replace(re, '1');
re = new RegExp ('[ÇCGJKQSXZ]+', 'gi');
name = name.replace(re, '2');
re = new RegExp ('[DT]+', 'gi');
name = name.replace(re, '3');
re = new RegExp ('[L]+', 'gi');
name = name.replace(re, '4');
re = new RegExp ('[MN]+', 'gi');
name = name.replace(re, '5');
re = new RegExp ('[R]+', 'gi');
name = name.replace(re, '6');
// vul aan tot 3 cijfers.
while (name.length<3)
name += "0";
// kap af op 3 cijfers
if (name.length>3)
name=name.substr(0,3);

return result+koppelteken+name;
}
else return "";
}
</script>


administrator

I wrote this function a long time ago in CGI-perl and then translated (if you can call it that) into PHP.  A little clunky to say the least, but should handle true soundex specs 100%:
// ---begin code---
function MakeSoundEx($stringtomakesoundexof)
{
$temp_Name = $stringtomakesoundexof;
$SoundKey1 = "BPFV";
$SoundKey2 = "CSKGJQXZ";
$SoundKey3 = "DT";
$SoundKey4 = "L";
$SoundKey5 = "MN";
$SoundKey6 = "R";
$SoundKey7 = "AEHIOUWY";
       $temp_Name = strtoupper($temp_Name);
$temp_Last = "";
$temp_Soundex = substr($temp_Name, 0, 1);
$n = 1;
for ($i = 0; $i < strlen($SoundKey1); $i++)
{
        if ($temp_Soundex == substr($SoundKey1, i - 1, 1))
{
$temp_Last = "1";
       }
}
for ($i = 0; $i < strlen($SoundKey2); $i++)
{
       if ($temp_Soundex == substr($SoundKey2, i - 1, 1))
{
$temp_Last = "2";
       }
}
for ($i = 0; $i < strlen($SoundKey3); $i++)
{
        if ($temp_Soundex == substr($SoundKey3, i - 1, 1))
{
$temp_Last = "3";
        }
}
for ($i = 0; $i < strlen($SoundKey4); $i++)
{
       if ($temp_Soundex == substr($SoundKey4, i - 1, 1))
{
$temp_Last = "4";
       }
}
for ($i = 0; $i < strlen($SoundKey5); $i++)
{
        if ($temp_Soundex == substr($SoundKey5, i - 1, 1))
{
$temp_Last = "5";
       }
}
for ($i = 0; $i < strlen($SoundKey6); $i++)
{
       if ($temp_Soundex == substr($SoundKey6, i - 1, 1))
{
$temp_Last = "6";
       }
}
for ($i = 0; $i < strlen($SoundKey6); $i++)
{
        if ($temp_Soundex == substr($SoundKey6, i - 1, 1))
{
$temp_Last = "";
       }
}
for ($n = 1; $n < strlen($temp_Name); $n++)
{
if (strlen($temp_Soundex) < 4)
{
for ($i = 0; $i < strlen($SoundKey1); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey1, $i - 1, 1) && $temp_Last != "1")
{
$temp_Soundex = $temp_Soundex."1";
$temp_Last = "1";
}
}
for ($i = 0; $i < strlen($SoundKey2); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey2, $i - 1, 1) && $temp_Last != "2")
{
$temp_Soundex = $temp_Soundex."2";
$temp_Last = "2";
}
}
for ($i = 0; $i < strlen($SoundKey3); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey3, $i - 1, 1) && $temp_Last != "3")
{
$temp_Soundex = $temp_Soundex."3";
$temp_Last = "3";
}
}
for ($i = 0; $i < strlen($SoundKey4); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey4, $i - 1, 1) && $temp_Last != "4")
{
$temp_Soundex = $temp_Soundex."4";
$temp_Last = "4";
}
}
for ($i = 0; $i < strlen($SoundKey5); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey5, $i - 1, 1) && $temp_Last != "5")
{
$temp_Soundex = $temp_Soundex."5";
$temp_Last = "5";
}
}
for ($i = 0; $i < strlen($SoundKey6); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey6, $i - 1, 1) && $temp_Last != "6")
{
$temp_Soundex = $temp_Soundex."6";
$temp_Last = "6";
}
}
for ($i = 0; $i < strlen($SoundKey7); $i++)
{
if (substr($temp_Name, $n - 1, 1) == substr($SoundKey7, $i - 1, 1))
{
$temp_Last = "";
}
}
}
}
while (strlen($temp_Soundex) < 4)
{
$temp_Soundex = $temp_Soundex."0";
}
return $temp_Soundex;
}
// --- end code---


justin

I originally looked at soundex() because I wanted to compare how individual letters sounded. So, when pronouncing a string of generated characters it would be easy to to distinguish them from eachother.  (ie, TGDE is hard to distinguish, whereas RFQA is easier to understand). The goal was to generate IDs that could be easily understood with a high degree of accuracy over a radio of varying quality. I quickly figured out that soundex and metaphone wouldn't do this (they work for words), so I wrote the following to help out. The ID generation function iteratively calls chrSoundAlike() to compare each new character with the preceeding characters. I'd be interested in recieving any feedback on this. Thanks.
<?php
function chrSoundAlike($char1, $char2, $opts = FALSE) {
$char1 = strtoupper($char1);
$char2 = strtoupper($char2);
$opts  = strtoupper($opts);
// Setup the sets of characters that sound alike.
// (Options: include numbers, include W, include both, or default is none of those.)
switch ($opts) {
case 'NUMBERS':
$sets = array(0 => array('A', 'J', 'K'),
         1 => array('B', 'C', 'D', 'E', 'G', 'P', 'T', 'V', 'Z', '3'),
         2 => array('F', 'S', 'X'),
         3 => array('I', 'Y'),
         4 => array('M', 'N'),
         5 => array('Q', 'U', 'W'));
break;
case 'STRICT':
$sets = array(0 => array('A', 'J', 'K'),
         1 => array('B', 'C', 'D', 'E', 'G', 'P', 'T', 'V', 'Z'),
         2 => array('F', 'S', 'X'),
         3 => array('I', 'Y'),
         4 => array('M', 'N'),
         5 => array('Q', 'U', 'W'));
break;

case 'BOTH':
$sets = array(0 => array('A', 'J', 'K'),
         1 => array('B', 'C', 'D', 'E', 'G', 'P', 'T', 'V', 'Z', '3'),
         2 => array('F', 'S', 'X'),
         3 => array('I', 'Y'),
         4 => array('M', 'N'),
         5 => array('Q', 'U', 'W'));
break;
default:
$sets = array(0 => array('A', 'J', 'K'),
         1 => array('B', 'C', 'D', 'E', 'G', 'P', 'T', 'V', 'Z'),
         2 => array('F', 'S', 'X'),
         3 => array('I', 'Y'),
         4 => array('M', 'N'),
         5 => array('Q', 'U'));
break;
}

// See if $char1 is in a set.
$matchset = array();
for ($i = 0; $i < count($sets); $i++) {
if (in_array($char1, $sets[$i])) {
$matchset = $sets[$i];
}
}
// IF char2 is in the same set as char1, or if char1 and char2 and the same, then return true.
if (in_array($char2, $matchset) OR $char1 == $char2) {
return TRUE;
} else {
return FALSE;
}
}
?>


fred

I am finding a peculiarity in using the Soundex function in PHP (4.2.2) against the one in MySql (3.23.39) .  For instance, a soundex on the word "purchased" in PHP results in 'P622'.  The same word using the MySql Soundex function results in 'P623'.  The word 'five' using PHP results in 'F100' while MySql results in 'F000'.  I have only caught a couple that are different so far - don't know if it's a bug in either PHP or MySql or something than can be adjusted/accounted for in code.
Fred Dirkse
OIC Group, Inc.
www.oicgroup.net


pee whitt

fie at myrealbox dot com-
regarding your soudex syllable request- i think counting vowel clusters in the word will result in an accurate count of syllables.  so no soudex feature is necessary, just count through the chars in the word, and everytime you run from vowel to consanant, increment the syllable count.
using this logic, this sentence is categorized as follows.
2 1 2 1 1 (3) (0) (4) (0) 2
where (#) marks a word that is incorrectly categorized.  i'm sure usiong a little thinking one could figure out the logic in those cases that would result in an accurate count.  counting changes from vowel to consanant would yield-
(1) 1 2 1 2 1 (4) 1 2
taking the average and then cieling of the two types would fix most of the errors.


a4_perfect

Even be rewritten, function of [administrator at zinious dot com] is slower than soundex() for approx 30 times:
<?php
function MakeSoundEx($stringtomakesoundexof)
{
 $temp_Name = strtoupper($stringtomakesoundexof);
 $SoundKey = array(1=>"BPFV", "CSKGJQXZ", "DT", "L", "MN", "R", "AEHIOUWY");
 $temp_Last = "";
 $temp_Soundex = substr($temp_Name, 0, 1);
 for ($x = 1; $x <= sizeof($SoundKey); $x++)
   for ($i = 0; $i < strlen($SoundKey[$x]); $i++)
     if ($temp_Soundex == substr($SoundKey[$x], $i - 1, 1))
       $temp_Last = (string)($x==7?"":$x);
 for ($n = 1; $n < strlen($temp_Name); $n++)
   if (strlen($temp_Soundex) < 4)
   {
     for ($x = 1; $x <= sizeof($SoundKey); $x++)
       for ($i = 0; $i < strlen($SoundKey[$x]); $i++)
         if (substr($temp_Name, $n-1, 1)==substr($SoundKey[$x], $i-1, 1))
         {
           if($x<7 && $temp_Last!=(string)$x)
             $temp_Soundex = $temp_Soundex.$x;
           $temp_Last = (string)($x);
         }
     }
 return $temp_Soundex . str_repeat("0", 4-strlen($temp_Soundex));
}
?>


fie

eek... hosting got taken down on that server.. here's the code for the previous
function cg_sylc($nos){
 $nos = strtoupper($nos);
 $syllables = 0;
 $before = strlen($nos);
 $nos = str_replace(array('AA','AE','AI','AO','AU',
 'EA','EE','EI','EO','EU','IA','IE','II','IO',
 'IU','OA','OE','OI','OO','OU','UA','UE',
 'UI','UO','UU'), "", $nos);
 $after = strlen($nos);
 $diference = $before - $after;
 if($before != $after) $syllables += $diference / 2;
 if($nos[strlen($nos)-1] == "E") $syllables --;
 if($nos[strlen($nos)-1] == "Y") $syllables ++;
 $before = $after;
 $nos = str_replace(array('A','E','I','O','U'),"",$nos);
 $after = strlen($nos);
 $syllables += ($before - $after);
 return $syllables;
}
function cg_SoundEx($SExStr){
 $syl = cg_sylc($SExStr);
 $SExStr = strtoupper($SExStr);
   for($i = 1, $ii = 2,print $SExStr[0]; ;$ii++){
     if(($SExStr[$i] != $SExStr[$ii])){
         $tsstr .= $SExStr[$ii];
         $i ++;
     }
     if($SExStr[$ii] == false){
       break;
     }
   }
 $tsstr = str_replace(array('A', 'E', 'H', 'I', 'O', 'U', 'W', 'Y'), "", $tsstr);
 $tsstr = str_replace(array('B', 'F', 'P', 'V'), "1", $tsstr);
 $tsstr = str_replace(array('C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z', 'Ç'), "2", $tsstr);
 $tsstr = str_replace(array('D', 'T'), "3", $tsstr);
 $tsstr = str_replace(array('L'), "4", $tsstr);
 $tsstr = str_replace(array('M', 'N', 'Ñ'), "5", $tsstr);
 $tsstr = str_replace(array('R'), "6", $tsstr);
 while($iii < 3){
   if($tsstr[$iii] != false){
     $ttsstr .= $tsstr[$iii];
   } else {
     $ttsstr .= "0";
   }
   $iii ++;
 }
 $ttsstr .= $syl;
 print $ttsstr;
}


dalibor dot toth

Apart from Clansy nad Klansy, using strings "rest" and "reset" with soundex() gives you: 1. R230, 2. R230. Pitty...

dcallaghan

Although the standard soundex string is 4 characters long, and this is what's returned by the php function, some database programs return an arbitrary number of strings. MySQL, for instance.
The MySQL documentation covers this, recommending that you may wish to use substring to output the standard 4 characters. Let's take 'Dostoyevski' as an example.
select soundex("Dostoyevski")
returns D2312
select substring(soundex("Dostoyevski"), 1, 4);
returns D231
PHP will return the value as 'D231'
So, to use the soundex function to generate a WHERE parameter in a MySQL SELECT statement, you might try this:
$s = soundex('Dostoyevski');
SELECT * FROM authors WHERE substring(soundex(lastname), 1 , 4) = "' . $s  . '"';
Or, if you want to bypass the php function
$result = mysql_query("select soundex('Dostoyevski')");
$s = mysql_result($result, 0, 0);


fie

administrator at zinious dot com:
Sorry but your code wasnt soundex compliant
here were my results with your code, my code, and the default..
string: rest
R620 perform administrator's function 0.009452
R230 perform cg's function 0.001779
R230 perform default soundex function 9.4999999999956E-005
string: reset
R620 perform administrator's function 0.0055900000000001
R230 perform cg's function 0.00091799999999997
R230 perform default soundex function 0.00010600000000005
i dunno why the default, every once in a while, will for some reason be 9.xxx. very odd i think..
my code is at the bottom.. these tests were before the soundex modification as i discribe below..
btw for all the original specs on the soundex algorithm goto
http://www.star-shine.net/~functionifelse/GFD/?word=soundex
dalibor dot toth at podravka dot hr:
yes it is perhaps sad that it gives you the same code,
even metaphone has that problem..
but one might not want to be so accurate.. if somone
is on search engine.. lets call it shmoogle looking
for "php array reset" and search for "php array rest"
then shmoogle might return stuff about beds and such..
(if they were all stupid and didnt use the first words
as more important) so anyways shmoogle might need it to
be less accurate in such cases.. but nonetheless..
my fix for this is to add the number of syllables at the end of the string making it 5 characters long..
this would work as fallows..
code at: http://star-shine.net/~functionifelse/cg_soundex.php
or if you wanted to just use the default soundex function
$str = soundex($str).cg_sylc($str);
revolutionary more or less.. problly less...
This function is only meant for one word though.. i'd like to see someone
modify it to use split and run it through a loop to get each words cg_soundex
that'll be fun ;)
i would also like to sujest to the php zend apache kinda people who make php
to add an optional additional variable the user can specify as fallows
soundex("string",SYL);
which would return the number of syllables at the end of the string
highly accurate sound testing woo! also you could add VOW for vowels
and CONS for consonant or whatever else someone would want..
but i really think the number of syllables will be pleanty efficiant.
umm.. if this helps anyone your welcome.. ummm.. good luck in all
your php adventures.. oh... and the final results
syllables
1 rest
2 reset
metaphone
RST rest
RST reset
soundex
R230 rest
R230 reset
string: rest
R2301 perform cg's function 0.00211
R230 perform default soundex function 0.00011299999999997
string: reset
R2302 perform cg's function 0.001691
R230 perform default soundex function 0.00010399999999999
the default function is a tad bit faster..
so maybe they will add this option and we'll have speed and accuracy.
SILENT WIND OF DOOM WOOSH!


jr

a workaround for the mysql/php differences in implementation of soundex is to do the soundex comparison entirely within mysql.
for example:
$sql = "SELECT * FROM table WHERE substring(soundex(field), 1, 4) =  substring(soundex('".$wordsearch."'), 1, 4)";


17-may-2002 06:07

A MUCH easier way to do the above search would be to simply add any letter in front of the string and then compare them.
ie. Klancy => LKlancy
   Clancy => LClancy


witold4249

A MUCH easier way to check for similarity between words and avoid the problems that come up with Klancy/Clancy would be to simply add any letter infront of the string
ie:  OKlancy/OClancy


marc quinton.

a French soundex version ; could be used for other foreigns languages where soudex lacks. Perhaps, a class with each language specifics could be writen.
http://www.php-help.net/sources-php/a.french.adapted.soundex.289.html


Change Language


Follow Navioo On Twitter
addcslashes
addslashes
bin2hex
chop
chr
chunk_split
convert_cyr_string
convert_uudecode
convert_uuencode
count_chars
crc32
crypt
echo
explode
fprintf
get_html_translation_table
hebrev
hebrevc
html_entity_decode
htmlentities
htmlspecialchars_decode
htmlspecialchars
implode
join
levenshtein
localeconv
ltrim
md5_file
md5
metaphone
money_format
nl_langinfo
nl2br
number_format
ord
parse_str
print
printf
quoted_printable_decode
quotemeta
rtrim
setlocale
sha1_file
sha1
similar_text
soundex
sprintf
sscanf
str_getcsv
str_ireplace
str_pad
str_repeat
str_replace
str_rot13
str_shuffle
str_split
str_word_count
strcasecmp
strchr
strcmp
strcoll
strcspn
strip_tags
stripcslashes
stripos
stripslashes
stristr
strlen
strnatcasecmp
strnatcmp
strncasecmp
strncmp
strpbrk
strpos
strrchr
strrev
strripos
strrpos
strspn
strstr
strtok
strtolower
strtoupper
strtr
substr_compare
substr_count
substr_replace
substr
trim
ucfirst
ucwords
vfprintf
vprintf
vsprintf
wordwrap
eXTReMe Tracker