[posted & mailed] [Andrew Dalke] > I'm also surprised. I haven't heard of that alignment method > before, as compared to sequence alignment methods for bioinformatics. > But then I haven't really delved into that topic with sequences. That's because you're trying to outguess nature, but SequenceMatcher is trying to outguess people: the latter doesn't care about the most efficient way to change X into Y, it wants to guess what a *person* probably did when changing X into Y by hand. This can make text-file diffs much easier to follow. before: private Thread currentThread; after: private volatile Thread currentThread; Run this thru your usual sequence comparator, and it's likely to say "Ah, I know! They inserted 'e volatil' between the 't' and 'e' at the end of 'private'". They can reach this bizarre conclusion because 'e Thread currentThread' is a long matching subsequence. Or: before: ab after: acab "Ah, I know! They stuffed 'ca' into the middle!" (something Unix diff and Windows Windiff both believe, albeit via entirely different algorithms). But "it's obvious" how a *person* changed the first example, and more likely than not that a *person* actually tacked 'ac' on to the front in the second example. SequenceMatcher is trying to get those results, even if there are "more efficient" ways to get from before to after. AFAIK, it's only people using vi who spend minutes plotting out the most efficient key sequence to get a job done <wink>. > ... > Perhaps some of the biopython work along these lines might be > relevant to the general population? I don't know, but I named it "difflib.py" instead of "sequencematcher.py" just in case one of you intrepid researchers did have something of general value. difflib.py came into being when Eric Raymond was writing a more general library, and we both came to the conclusion that the SequenceMatcher algorithm is the only one we actually had good results with in practice. But he was trying to outguess people too. > ... > Hmm, and it seems difflib has support for masking what is sometimes > viewed as low information content regions. No, Wootton has support for what difflib views as clumps of junk. I'm not a fan of overblown terminology unless *so* overblown it's funny <wink>. comparatively y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4