[Eric, on "edit distance"] > I found some formal descriptions of the algorithm and some > unencumbered Oberon source. I'm coding up C now. It's not > complicated if you're willing to hold the cost matrix in memory, > which is reasonable for a string comparator in a way it wouldn't > be for a file diff. All agreed, and it should be a straightforward task then. I'm assuming it will work with Unicode strings too <wink>. [on differing weights] > Which about collapse into one if your function has three weight > arguments for insert/replace/delete weights, as mine does. It don't > get more general than that -- I can see that by looking at the formal > description. > > OK, so I'll give you that I don't weight transpositions separately, > but neither does any other variant I found on the web nor the formal > descriptions. A fourth optional weight agument someday, maybe :-). > ... > and that's why I want it in the Python library, so doing the right > thing in Python will be a minimum-effort proposition. Guido will depart from you at a different point. I depart here: it's not "the right thing". It's a bunch of hacks that appeal not because they solve a problem, but because they're cute algorithms that are pretty easy to implement and kinda solve part of a problem. "The right thing"-- which you can buy --at least involves capturing a large base of knowledge about phonetics and spelling. In high school, one of my buddies was Dan Pryzbylski. If anyone who knew him (other than me <wink>) were to type his name into the class reunion guy's web page, they'd probably spell it the way they remember him pronouncing it: sha-bill-skey (and that's how he pronounced "Dan" <wink>). If that hit on the text string "Pryzbylski", *then* it would be "the right thing" in a way that makes sense to real people, not just to implementers. Working six years in commercial speech recog really hammered that home to me: 95% solutions are on the margin of unsellable, because an error one try in 20 is intolerable for real people. Developers writing for developers get "whoa! cool!" where my sisters walk away going "what good is that?". Edit distance doesn't get within screaming range of 95% in real life. Even for most developers, it would be better to package up the single best approach you've got (f(list, word) -> list of possible matches sorted in confidence order), instead of a module with 6 (or so) functions they don't understand and a pile of equally mysterious knobs. Then it may actually get used! Developers of the breed who would actually take the time to understand what you've done are, I suggest, similar to us: they'd skim the docs, ignore the code, and write their own variations. Or, IOW: > so doing the right thing in Python will be a minimum-effort > proposition. Make someone think first, and 95% of developers will just skip over it too. BTW, the theoretical literature ignored transposition at first, because it didn't fit well in the machinery. IIRC, I first read about it in an issue of SP&E (Software Practice & Experience), where the authors were forced into it because the "traditional" edit sequence measure sucked in their practice. They were much happier after taking transposition into account. The theoreticians have more than caught up since, and research is still active; e.g., 1997's PATTERN RECOGNITION OF STRINGS WITH SUBSTITUTIONS, INSERTIONS, DELETIONS AND GENERALIZED TRANSPOSITIONS B. J. Oommen and R. K. S. Loke http://www.scs.carleton.ca/~oommen/papers/GnTrnsJ2.PDF is a good read. As they say there, If one views the elements of the confusion matrices as probabilities, this [treating each character independent of all others, as "edit distance" does] is equivalent to assuming that the transformation probabilities at each position in the string are statistically independent and possess first-order Markovian characteristics. This model is usually assumed for simplicity rather it [sic] having any statistical significance. IOW, because it's easy to analyze, not because it solves a real problem -- and they're complaining about an earlier generalization of edit distance that makes the weights depend on the individual symbols involved as well as on the edit/delete/insert distinction (another variation trying to make this approach genuinely useful in real life). The Oommen-Loke algorithm appears much more realistic, taking into account the observed probabilities of mistyping specific letter pairs (although it still ignores phonetics), and they report accuracies approaching 98% in correctly identifying mangled words. 98% (more than twice as good as 95% -- the error rate is actually more useful to think about, 2% vs 5%) is truly useful for non-geek end users, and the state of the art here is far beyond what's easy to find and dead easy to implement. > ... > It probably won't surprise you that I considered writing an FFT > extension module at one point :-). Nope! More power to you, Eric. At least FFTs *are* state of the art, although *coding* them optimally is likely beyond human ability on modern machines: http://www.fftw.org/ (short course: they've generally got the fastest FFTs available, and their code is generated by program, systematically *trying* every trick in the book, timing it on a given box, and synthesizing a complete strategy out of the quickest pieces). sooner-or-later-the-only-code-real-people-will-use-won't-be-written- by-people-at-all-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4