I have a new goodie for the 2.1 standard library, a module called "simil" that supports computation of similarity indices between strings such as one might use for recovery-matching of misspellings against a dictionary. The three methods supported are stemming, normalized Hamming similarity, and (the star of the show) Ratcliff-Obershelp gestalt subpattern matching. The latter is spookily effective for detecting not just substition typos but insertions and deletions. The module is a C extension (my first!) for speed and because the Ratcliff-Obershelp implementation uses pointer arithmetic heavily. It's documented, tested, and ready to go. But having written it, I now have a question: why is soundex marked obsolete? Is there something wrong with the algorithm or implementation? If not, then it would be natural for simil to absorb the existing soundex implementation as a fourth entry point. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> Whether the authorities be invaders or merely local tyrants, the effect of such [gun control] laws is to place the individual at the mercy of the state, unable to resist. -- Robert Anson Heinlein, 1949 -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> Americans have the right and advantage of being armed - unlike the citizens of other countries whose governments are afraid to trust the people with arms. -- James Madison, The Federalist Papers
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4