Tim Peters wrote: > > [M.-A. Lemburg] > > BTW, are there less English centric "sounds alike" matchers > > around ? > > Yes, but if anything there are far too many of them: like Soundex, they're > just heuristics, and *everybody* who cares adds their own unique twists, > while proper studies are almost non-existent. Few variants appear to be in > use much beyond their inventor's friends; one notable exception in the > Jewish community is the Daitch-Mokotoff variation, originally tailored to > their unique needs but later generalized; a brief description here: > > http://www.avotaynu.com/soundex.html > > The similarly involved NYSIIS algorithm (New York State Identification > Intelligence System -- look for NYSIIS on Parnassus) was the winner from a > field of about two dozen competing algorithms, after measuring their > effectiveness on assorted databases maintained by the state of New York. > Since New York has a large immigrant population, NYSIIS isn't as > Anglocentric as Soundex either. Thanks for the pointer. I'll add that module to my lib :) http://metagram.webreply.com/downloads/nysiis.py Perhaps Eric ought to add this one to his package as well ?! BTW, where can I find your package on the web, Eric ? I'd like to give it a ride under German language conditions ;) > But state-of-the-art has given up on purely computational algorithms for > these purposes: proper names are simply too much a mess. For example, if I > search for "Richard", it *ought* to match on "Dick"; if my Arab buddy > searches on "Mohammed", it *ought* to match on "Mhd"; "the rules" people > actually use just aren't reducible to pure computation -- it takes a large > knowledge base to capture what people "just know". You may enjoy visiting > this commercial site (AFAIK, nobody is giving away state-of-the-art for > free): > > http://www.las-inc.com/ Sad -- "patent pending" algorithms don't help anyone on this planet :( > > ... > > http://physics.nist.gov/cuu/Reference/soundex.html > > > > works fine for English texts, > > If that were true, the English-speaking researchers would have declared > victory 120 years ago <wink>. But English pronunciation is *notoriously* > difficult to predict from spelling, partly because English is the Perl of > human languages. Then Dutch must be the Python of human languages... ;) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4