RetroSearch Browse

Tue Jul 13 04:28:00 CEST 2010 · https://mail.python.org/pipermail/python-dev/2010-July/101862.html

On 7/11/2010 11:02 PM, Tim Peters wrote:

>> The heuristic lowered the reported match ratio from .96 to .88, which
>> would be bad when one wanted the unaltered value.

> BTW, it's not clear whether ratio() computes a _useful_ value in the
> presence of junk, however that may be defined.

I agree, which is one reason why one should be to disable auto-junking.

There are a number of statistical methods for analyzing similarity 
matrices, analogous to correlation matrices, except that entries are in 
[0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such 
analyses for sets of species. Similarity measures should ususally be 
symmetric and increase with greater matching. The heuristic can cause 
both to fail.

> I suspect nobody cares ;-)

There are multiple possible definitions of similarity for sets (and 
arguments thereabout). I am sure the same is true for sequences. But I 
consider the definition for .ratio, without the heuristic, to be 
sensible. I would consider using it should the occasion arise.

> It certainly was the intent that nothing would be
> called junk unless it appeared at least twice, so the "n>= 200"
> clause ensures that 1% of n is at least 2.

Since 2 cannot be greater than something that is at least 2, you ensured 
that nothing would be called junk unless it appears as least thrice.

> However, I'm wary of introducing a generalization in the absence of
> experience saying people would use it.  Is this the right kind of
> parametrization?  Is this even the right kind of way to go about
> auto-detecting junk?  I know it worked great for the original use case
> that drove it, but I haven't seen anyone say they want a notion of
> auto-junk detection for other uses - just that they _don't_ want the
> wholly inappropriate current auto-junk detection in (some or all) of
> their uses.
>
> IOW, it's hard to generalize confidently from a sample of one :-(
>
>> Implementation: Add a new parameter named 'common' or 'threshold' or
>> whatever that defaults to 1.
>
> I'd call it "autojunk", cuz that's what it would do.  Also a useful
> syntactic overlap with the name of the current "isjunk" argument.

I like that. I am now leaning toward the following?

G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 
'autojunk = True' to the constructor signature. This is the minimal 
change that fixes the bug of no choice while keeping the default as is. 
So it is a minimal violation of the usual stricture against API changes 
in bugfix releases. I would doc this as "Use an internal heuristic that 
identifies 'common' items as junk." and separately describe the 'current 
heuristic', leaving open the possibility of changing it.

Possible suboption: enforce 'autojunk in (True,False)' so the user 
cannot forget that it is a switch and not a tuning parameter.

In 3.2, expose as an attribute a tuple 'hueristic' or '_heuristic' with 
the tuning parameters for the current heuristic. Adding the _ would 
indicate that is it a private, expert-only, use at your own risk, 
subject to change attribute.

If we agree on this much, we can then discuss what the tuple should be 
for 3.2.

>> Other changes that apply regardless of the heuristic/api change:
>>
>> Update the code to use sets (newer than difflib) instead of dicts with
>> values set to 1.

>> Directly expose the set of 'common' items as an additional attribute of
>> SequenceMatcher instances. Such instance attributes are currently
>> undocumented, so adding one can hardly be a problem. Add documention
>> thereof. Being able to see the effect of the heuristic when it is not turned
>> off might help people decide whether or not to use it, or how to tune the
>> threshold for smallish alphabets where 1 is too small.
>
> Wholly agreed.  junkdict (after turning it into a set) should also be
> exposed - when someone passes in a fancy regexp matcher for the isjunk
> argument, they can be surprised at what their regexp matches.  Being
> able to see the results can be helpful there too, for debugging.

I meant to include junkdict also.

-- 
Terry Jan Reedy

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2010-July/101862.html below:

[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken