On 7/11/2010 11:02 PM, Tim Peters wrote: >> The heuristic lowered the reported match ratio from .96 to .88, which >> would be bad when one wanted the unaltered value. > BTW, it's not clear whether ratio() computes a _useful_ value in the > presence of junk, however that may be defined. I agree, which is one reason why one should be to disable auto-junking. There are a number of statistical methods for analyzing similarity matrices, analogous to correlation matrices, except that entries are in [0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of species. Similarity measures should ususally be symmetric and increase with greater matching. The heuristic can cause both to fail. > I suspect nobody cares ;-) There are multiple possible definitions of similarity for sets (and arguments thereabout). I am sure the same is true for sequences. But I consider the definition for .ratio, without the heuristic, to be sensible. I would consider using it should the occasion arise. > It certainly was the intent that nothing would be > called junk unless it appeared at least twice, so the "n>= 200" > clause ensures that 1% of n is at least 2. Since 2 cannot be greater than something that is at least 2, you ensured that nothing would be called junk unless it appears as least thrice. > However, I'm wary of introducing a generalization in the absence of > experience saying people would use it. Is this the right kind of > parametrization? Is this even the right kind of way to go about > auto-detecting junk? I know it worked great for the original use case > that drove it, but I haven't seen anyone say they want a notion of > auto-junk detection for other uses - just that they _don't_ want the > wholly inappropriate current auto-junk detection in (some or all) of > their uses. > > IOW, it's hard to generalize confidently from a sample of one :-( > >> Implementation: Add a new parameter named 'common' or 'threshold' or >> whatever that defaults to 1. > > I'd call it "autojunk", cuz that's what it would do. Also a useful > syntactic overlap with the name of the current "isjunk" argument. I like that. I am now leaning toward the following? G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk = True' to the constructor signature. This is the minimal change that fixes the bug of no choice while keeping the default as is. So it is a minimal violation of the usual stricture against API changes in bugfix releases. I would doc this as "Use an internal heuristic that identifies 'common' items as junk." and separately describe the 'current heuristic', leaving open the possibility of changing it. Possible suboption: enforce 'autojunk in (True,False)' so the user cannot forget that it is a switch and not a tuning parameter. In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with the tuning parameters for the current heuristic. Adding the _ would indicate that is it a private, expert-only, use at your own risk, subject to change attribute. If we agree on this much, we can then discuss what the tuple should be for 3.2. >> Other changes that apply regardless of the heuristic/api change: >> >> Update the code to use sets (newer than difflib) instead of dicts with >> values set to 1. >> Directly expose the set of 'common' items as an additional attribute of >> SequenceMatcher instances. Such instance attributes are currently >> undocumented, so adding one can hardly be a problem. Add documention >> thereof. Being able to see the effect of the heuristic when it is not turned >> off might help people decide whether or not to use it, or how to tune the >> threshold for smallish alphabets where 1 is too small. > > Wholly agreed. junkdict (after turning it into a set) should also be > exposed - when someone passes in a fancy regexp matcher for the isjunk > argument, they can be surprised at what their regexp matches. Being > able to see the results can be helpful there too, for debugging. I meant to include junkdict also. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4