[M.-A. Lemburg] > BTW, are there less English centric "sounds alike" matchers > around ? Yes, but if anything there are far too many of them: like Soundex, they're just heuristics, and *everybody* who cares adds their own unique twists, while proper studies are almost non-existent. Few variants appear to be in use much beyond their inventor's friends; one notable exception in the Jewish community is the Daitch-Mokotoff variation, originally tailored to their unique needs but later generalized; a brief description here: http://www.avotaynu.com/soundex.html The similarly involved NYSIIS algorithm (New York State Identification Intelligence System -- look for NYSIIS on Parnassus) was the winner from a field of about two dozen competing algorithms, after measuring their effectiveness on assorted databases maintained by the state of New York. Since New York has a large immigrant population, NYSIIS isn't as Anglocentric as Soundex either. But state-of-the-art has given up on purely computational algorithms for these purposes: proper names are simply too much a mess. For example, if I search for "Richard", it *ought* to match on "Dick"; if my Arab buddy searches on "Mohammed", it *ought* to match on "Mhd"; "the rules" people actually use just aren't reducible to pure computation -- it takes a large knowledge base to capture what people "just know". You may enjoy visiting this commercial site (AFAIK, nobody is giving away state-of-the-art for free): http://www.las-inc.com/ > ... > http://physics.nist.gov/cuu/Reference/soundex.html > > works fine for English texts, If that were true, the English-speaking researchers would have declared victory 120 years ago <wink>. But English pronunciation is *notoriously* difficult to predict from spelling, partly because English is the Perl of human languages. or-maybe-the-borg-assuming-there's-a-difference<wink>-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4