[Tim] > Actually, I'm certain they're the same algorithm now, except the C is > showing through in ratcliff to the floating-point eye <wink>. [Eric] > Take a look: Yup, same thing, except: > static float ratcliff(char *s1, char *s2) accounts for the numeric differences (change "float"->"double" and they'd be the same; Python has to convert it to a double anyway, lacking any internal support for C's floats; and the C code is *computing* in double regardless, cutting it back to a float upon return just because of the "float" decl). The code in SequenceMatcher doesn't *look* anything like it, though, due to years of dreaming up faster ways to do this (in its original role as a diff generator, it routinely had to deal with sequences containing 10s of thousands of elements, and code very much like the code you posted was just too slow for that). One simple trick that can enormously speed the worst cases: the "find the longest match starting here" innermost loop is guarded by > if (*a1 == *a2) However, it can't possibly find a *bigger* max unless it's also the case that a1[max) == a2[max) That's usually false in real life, so by adding that test to the guard you usually get to skip the innermost loop entirely. Probably more important in a diff-generator role, though. SequenceMatcher's prime trick is to preprocess one of the strings, in linear time building up a hash table mapping each character in the string to a list of the indices at which it appears. Then the second-innermost loop is saved from needing to do any search: when we get to, e.g., 'x' in the other string, the precomputed hash table tells us directly where to find all the x's in the original string. And in the match-1-against-N case, this hash table can be computed once & reused N times. That's a monster win. However, I never had the patience to code that in C, so I never *did* that before I reimplemented my stuff in Python. Now the Python ndiff runs circles around the old Pascal and C versions. I'm sure that has nothing to do with machines having gotten 100x faster in the meantime <wink>> for-short-1-against-1-matches-yours-will-certainly-be-quicker-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4