[Eric] > OK, now I understand why soundex isn't in the core -- there's no > canonical version. Actually, I think Knuth Vol 3 Ed 3 is canonical *now* -- nobody would dare to oppose him <0.5 wink>. > Normalized Hamming similarity: it's an inversion of Hamming distance > -- number of pairwise matches in two strings of the same length, > divided by the common string length. Gives a measure in [0.0, 1.0]. > > I've looked up "Levenshtein edit distance" and you're rigbt. I'll add > it as a fourth entry point as soon as I can find C source to crib. > (Would you happen to have a pointer?) If you throw almost everything out of Unix diff, that's what you'll be left with. Offhand I don't know of enencumbered, industrial-strength C source; a problem is that writing a program to compute this is a std homework exercise (it's a common first "dynamic programming" example), so you can find tons of bad C source. Caution: many people want small variations of "edit distance", usually via assigning different weights to insertions, replacements and deletions. A less common but still popular variant is to say that a transposition ("xy" vs "yx") is less costly than a delete plus an insert. Etc. "edit distance" is really a family of algorithms. >> + Each algorithm on its own is likely controversial. > Not these. There *are* canonical versions of all these, See the "edit distance" gloss above. > and exact equivalents are all heavily used in commercial OCR > software. God forbid that core Python may lose the commercial OCR developer market <wink>. It's not accepted that for every field F, core Python needs to supply the algorithms F uses heavily. Heck, core Python doesn't even ship with an FFT! Doesn't bother the folks working in signal processing. >> + Computing string similarity is something few apps need anyway. > Tim, this isn't true. Any time you need to validate user input > against a controlled vocabulary and give feedback on probable right > choices, Which is something few apps need anyway -- in my experience, but more so in my *primary* role here of trying to channel for you (& Guido) what Guido will say. It should be clear that I've got some familiarity with these schemes, so it should also be clear that Guido is likely to ask me about them whenever they pop up. But Guido has hardly ever asked me about them over the past decade, with the exception of the short-lived Soundex brouhaha. From that I guess hardly anyone ever asks *him* about them, and that's how channeling works: if this were an area where Guido felt core Python needed beefier libraries, I'm pretty sure I would have heard about it by now. But now Guido can speak for himself. There's no conceivable argument that could change what I *predict* he'll say. > R/O similarity is *very* useful. I've had it in my personal > toolkit for a decade and used it heavily for this -- you take your > unknown input, check it against a dictionary and kick "maybe you meant > foo?" to the user for every foo with an R/O similarity above 0.6 or so. > > The effects look like black magic. Users love it. I believe that. And I'd guess we all have things in our personal toolkits our users love. That isn't enough to get into the core, as I expect Guido will belabor on the next iteration of this <wink>. doesn't-mean-the-code-isn't-mondo-cool-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4