Mikkel Rasmussen <footech at get2net.dk> wrote in message news:5mSE6.44$677.2732 at news.get2net.dk... > > Tim Peters <tim.one at home.com> wrote in message > news:mailman.987979892.8018.python-list at python.org... > > [Mikkel Rasmussen] > > > Has anybody got any references for the algorithm used in difflib. > > > > Nope. It's an algorithm I dreamed up in the early 80's, while at Cray > > Research, writing a file-differencing tool. I've carried it with me ever > > since, reimplementing it in at least a dozen languages along the way. The > > basic idea popped up when staring at some incomprehensible "diff" output, > and > > wondering "hmmmm -- what if diff restricted itself to *contiguous* > > subsequences? and then what if it didn't synch up on junk either?". > That's > > about it. You can find more discussion in the comments in module ndiff.py > > (difflib's SequenceMatcher class started its life in ndiff.py years ago; > it > > simply got split out into its own module for 2.1). > > > > When Eric Raymond mentioned the Ratcliff-Obershelp algorithm on > Python-Dev, > > it was news to me, and a Google search quickly turned up a C > implementation > > of that. It was obviously the same basic algorithm, except without the > "skip > > junk" gimmick, and at least the implementations I found were much less > > efficient than SequenceMatcher (this is one of those happy instances where > my > > Python code runs *faster* than previous C, Fortran and Pascal > > implementations: I had the *idea* of chaining together hash lists to > speed > > the search a long time ago, but never had the patience to *implement* that > > before recoding in Python -- it's easy in Python). > > > > I have now tested the difflib algorithm and my own algorithm. The difflib is > about twice as fast as my own (which is not optimised in any way). They > return near identical results. It will be interesting to look closer at the > difflib algorithm. > > Mikkel Rasmussen > > I have now combined some of the difflib rudimentary functions with my own sequence matcher (fuzzy match) algorithm. My own algorithm now runs ca. as fast as the get_close_matches function, but it returns better results for a large number of possibilities. Thanks for the info. Mikkel Rasmussen
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4