OK, now I understand why soundex isn't in the core -- there's no canonical version. Tim Peters <tim.one@home.com>: > + There are any number of algorithms people may want to see (I don't know > what "normalized Hamming similarity" means, but if it's not the same as > Levenshtein edit distance then add the latter to the pot too). Normalized Hamming similarity: it's an inversion of Hamming distance -- number of pairwise matches in two strings of the same length, divided by the common string length. Gives a measure in [0.0, 1.0]. I've looked up "Levenshtein edit distance" and you're rigbt. I'll add it as a fourth entry point as soon as I can find C source to crib. (Would you happen to have a pointer?) > + Each algorithm on its own is likely controversial. Not these. There *are* canonical versions of all these, and exact equivalents are all heavily used in commercial OCR software. > + Computing string similarity is something few apps need anyway. Tim, this isn't true. Any time you need to validate user input against a controlled vocabulary and give feedback on probable right choices, R/O similarity is *very* useful. I've had it in my personal toolkit for a decade and used it heavily for this -- you take your unknown input, check it against a dictionary and kick "maybe you meant foo?" to the user for every foo with an R/O similarity above 0.6 or so. The effects look like black magic. Users love it. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a> "I hold it, that a little rebellion, now and then, is a good thing, and as necessary in the political world as storms in the physical." -- Thomas Jefferson, Letter to James Madison, January 30, 1787
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4