[Tim] >> ...It's a bunch of hacks that appeal not because they solve >> a problem, but because they're cute algorithms that are pretty >> easy to implement and kinda solve part of a problem. [Eric] > Again, my experience says differently. I have actually *used* > Ratcliff-Obershelp to implement Do What I Mean (actually, Tell Me What > I Mean) -- and had it work very well for non-geek users. That's why I > want other Python programmers to have easy access to the capability. > ... > Where techniques like Ratcliff-Obershelp really shine (and what I > expect the module to be used for) is with controlled vocabularies > such as command interfaces. Yet the narrower the domain, the less call for a library with multiple approaches. If R-O really shone for you, why bother with anything else? Seriously. You haven't used some (most?) of these. The core isn't a place for research modules either (note that I have no objection whatsoever to writing any module you like -- the only question here is what belongs in the core, and any algorithm *nobody* here has experience with in your target domain is plainly a poor *core* candidate for that reason alone -- we have to maintain, justify and explain it for years to come). > I suspect your speech recognition experience has given you an > unhelpful bias. Try to think of it as a helpfully different perspective <0.5 wink>. It's in favor of measuring error rate by controlled experiments, skeptical of intuition, and dismissive of anecdotal evidence. I may well agree you don't need all that heavy machinery if I had a clear definition of what problem it is you're trying to solve (I've learned it's not the kinds of problems *I* had in mind when I first read your description!). BTW, telephone speech recog requires controlled vocabularies because phone acoustics are too poor for the customary close-talking microphone approaches to work well enough. A std technique there is to build a "confusability matrix" of the words *in* the vocabulary, to spot trouble before it happens: if two words are acoustically confusable, it flags them and bounces that info back to the vocabulary designer. A similar approach should work well in your domain: if you get to define the cmd interface, run all the words in it pairwise through your similarity measure of choice, and dream up new words whenever a pair is "too close". That all but ensures that even a naive similarity algorithm will perform well (in telephone speech recog, the unconstrained error rate is up to 70% on cell phones; by constraining the vocabulary with the aid of confusability measures, we cut that to under 1%). > ... > (Actually, my gut after thinking about both algorithms hard is that > R/O is still a better technique than Levenshtein for the kind of > application I have in mind. But I also suspect the difference is > marginal.) So drop Levenshtein -- go with your best shot. Do note that they both (usually) consider a single transposition to be as much a mutation as two replacements (or an insert plus a delete -- "pure" Levenshtein treats those the same). What happens when the user doesn't enter an exact match? Does the kind of app you have in mind then just present them with a list of choices? If that's all (as opposed to, e.g., substituting its best guess for what the user actually typed and proceeding as if the user had given that from the start), then the evidence from studies says users are almost as pleased when the correct choice appears somewhere in the first three choices as when it appears as *the* top choice. A well-designed vocabulary can almost guarantee that happy result (note that most of the current research is aimed at the much harder job of getting the intended word into the #1 slot on the choice list). > (Other good uses for algorithms in this class include cladistics and > genomic analysis.) I believe you'll find current work in those fields has moved far beyond these simplest algorithms too, although they remain inspirational (for example, see "Protein Sequence Alignment and Database Scanning" at http://barton.ebi.ac.uk/papers/rev93_1/rev93_1.html Much as in typing, some mutations are more likely than others for *physical* reasons, so treating all pairs of symbols in the alphabet alike is too gross a simplification.). >> Even for most developers, it would be better to package up the >> single best approach you've got (f(list, word) -> list of possible >> matches sorted in confidence order), instead of a module with 6 >> (or so) functions they don't understand and a pile of equally >> mysterious knobs. > That's why good documentation, with motivating usage hints, is > important. I write good documentation, Tim. You're not going to find offense here even if you look for it, Eric <wink>: while only a small percentage of developers don't read docs at all, everyone else spaces out at least in linear proportion to the length of the docs. Most people will be looking for "a solution", not for "a toolkit". If the docs read like a toolkit, it doesn't matter how good they are, the bulk of the people you're trying to reach will pass on it. If you really want this to be *used*, supply one class that does *all* the work, including making the expert-level choices of which algorithm is used under the covers and how it's tuned. That's good advice. I still expect Guido won't want it in the core before wide use is a demonstrated fact, though (and no, that's not a chicken-vs-egg thing: "wide use" for a thing outside the core is narrower than "wide use" for a thing in the core). An exception would likely get made if he tried it and liked it a lot. But to get it under his radar, it's again much easier if the usage docs are no longer than a couple paragraphs. I'll attach a tiny program that uses ndiff's SequenceMatcher to guess which of the 147 std 2.0 top-level library modules a user may be thinking of (and best I can tell, these are the same results case-folding R/O would yield): Module name? random Hmm. My best guesses are random, whrandom, anydbm (BTW, the first choice was an exact match) Module name? disect Hmm. My best guesses are bisect, dis, UserDict Module name? password Hmm. My best guesses are keyword, getpass, asyncore Module name? chitchat Hmm. My best guesses are whichdb, stat, asynchat Module name? xml Hmm. My best guesses are xmllib, mhlib, xdrlib [So far so good] Module name? http Hmm. My best guesses are httplib, tty, stat [I was thinking of httplib, but note that it missed SimpleHTTPServer: a name that long just isn't going to score high when the input is that short] Module name? dictionary Hmm. My best guesses are Bastion, ConfigParser, tabnanny [darn, I *think* I was thinking of UserDict there] Module name? uuencode Hmm. My best guesses are code, codeop, codecs [Missed uu] Module name? parse Hmm. My best guesses are tzparse, urlparse, pre Module name? browser Hmm. My best guesses are webbrowser, robotparser, user Module name? brower Hmm. My best guesses are webbrowser, repr, reconvert Module name? Thread Hmm. My best guesses are threading, whrandom, sched Module name? pickle Hmm. My best guesses are pickle, profile, tempfile (BTW, the first choice was an exact match) Module name? shelf Hmm. My best guesses are shelve, shlex, sched Module name? katmandu Hmm. My best guesses are commands, random, anydbm [I really was thinking of "commands"!] Module name? temporary Hmm. My best guesses are tzparse, tempfile, fpformat So it gets what I was thinking of into the top 3 very often, and despite some wildly poor guesses at the correct spelling -- you'd *almost* think it was doing a keyword search, except the *unintended* choices on the list are so often insane <wink>. Something like that may be a nice addition to Paul/Ping's help facility someday too. Hard question: is that "good enough" for what you want? Checking against 147 things took no perceptible time, because SequenceMatcher is already optimized for "compare one thing against N", doing preprocessing work on the "one thing" that greatly speeds the N similarity computations (I suspect you're not -- yet). It's been tuned and tested in practice for years; it works for any sequence type with hashable elements (so Unicode strings are already covered); it works for long sequences too. And if R-O is the best trick we've got, I believe it already does it. Do we need more? Of course *I'm* not convinced we even need *it* in the core, but packaging a match-1-against-N class is just a few minutes' editing of what follows. something-to-play-with-anyway-ly y'rs - tim NDIFFPATH = "/Python20/Tools/Scripts" LIBPATH = "/Python20/Lib" import sys, os sys.path.append(NDIFFPATH) from ndiff import SequenceMatcher modules = {} # map lowercase module stem to module name for f in os.listdir(LIBPATH): if f.endswith(".py"): f = f[:-3] modules[f.lower()] = f def match(fname, numchoices=3): lower = fname.lower() s = SequenceMatcher() s.set_seq2(lower) scores = [] for lowermod, mod in modules.items(): s.set_seq1(lowermod) scores.append((s.ratio(), mod)) scores.sort() scores.reverse() return modules.has_key(lower), [x[1] for x in scores[:numchoices]] while 1: name = raw_input("Module name? ") is_exact, choices = match(name) print "Hmm. My best guesses are", ", ".join(choices) if is_exact: print "(BTW, the first choice was an exact match)"
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4