[Greg Ward] > ... > Just how many messages fall in that grey area anyways? Heh. Here's the probability distribution for the 4000 ham messages in my first test pair: Ham distribution for this pair: * = 67 items 0.00 4000 ************************************************************ 2.50 0 5.00 0 7.50 0 10.00 0 12.50 0 15.00 0 17.50 0 20.00 0 22.50 0 25.00 0 27.50 0 30.00 0 32.50 0 35.00 0 37.50 0 40.00 0 42.50 0 45.00 0 47.50 0 50.00 0 52.50 0 55.00 0 57.50 0 60.00 0 62.50 0 65.00 0 67.50 0 70.00 0 72.50 0 75.00 0 77.50 0 80.00 0 82.50 0 85.00 0 87.50 0 90.00 0 92.50 0 95.00 0 97.50 0 That is, they *all* got a "probability score" less than 2.5% (0.025). Here's the spam probability distribution across the same run: Spam distribution for this pair: * = 46 items 0.00 5 * 2.50 2 * 5.00 1 * 7.50 0 10.00 0 12.50 0 15.00 1 * 17.50 0 20.00 1 * 22.50 0 25.00 2 * 27.50 1 * 30.00 0 32.50 1 * 35.00 0 37.50 0 40.00 0 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 1 * 60.00 3 * 62.50 0 65.00 2 * 67.50 0 70.00 0 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 0 87.50 0 90.00 3 * 92.50 1 * 95.00 6 * 97.50 2715 ************************************************************ IOW, a spam usually scored at least 0.975 on this run, but some spams scored under 0.025. There's very little "in the middle". I've got 19 more sets like this if you care a lot <wink>. Here's the aggregate across all 20 runs (each msg is counted 4 times here, once for each of the runs in which it served in the prediction set against training on one of the 4 spam+ham collection pairs it doesn't belong to): Ham distribution for all runs: * = 1333 items 0.00 79938 ************************************************************ 2.50 8 * 5.00 3 * 7.50 0 10.00 3 * 12.50 1 * 15.00 3 * 17.50 1 * 20.00 1 * 22.50 0 25.00 0 27.50 0 30.00 1 * 32.50 4 * 35.00 2 * 37.50 0 40.00 2 * 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 0 60.00 0 62.50 1 * 65.00 0 67.50 0 70.00 2 * 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 1 * 87.50 1 * 90.00 0 92.50 1 * 95.00 1 * 97.50 21 * Spam distribution for all runs: * = 905 items 0.00 215 * 2.50 18 * 5.00 8 * 7.50 12 * 10.00 6 * 12.50 6 * 15.00 14 * 17.50 6 * 20.00 10 * 22.50 8 * 25.00 9 * 27.50 9 * 30.00 3 * 32.50 3 * 35.00 5 * 37.50 3 * 40.00 7 * 42.50 24 * 45.00 3 * 47.50 29 * 50.00 34 * 52.50 8 * 55.00 6 * 57.50 18 * 60.00 64 * 62.50 12 * 65.00 7 * 67.50 5 * 70.00 3 * 72.50 7 * 75.00 4 * 77.50 18 * 80.00 10 * 82.50 23 * 85.00 13 * 87.50 20 * 90.00 27 * 92.50 18 * 95.00 57 * 97.50 54256 ************************************************************ In percentage terms, very little lives outside the tips of the tail ends. Note that calling the spam cutoff 0.975 instead of 0.90 would save 2 false positives, at the expense of letting an additional 27+18+57 = 102 spams go thru. Here's the first example of a low-prob spam: """ Low prob spam! 0.0133104753792 Data/Spam/Set2/8007.txt prob('from:email name:<janet691') = 0.5 prob('the') = 0.5 prob('subject:Fred') = 0.5 prob('you') = 0.5 prob('was') = 0.305052 prob('bool:noorg') = 0.614515 prob('proposal') = 0.100629 prob('will') = 0.557569 prob('talk') = 0.507463 prob('send') = 0.858078 prob('nice') = 0.227838 prob('from:email addr:ac') = 0.0754717 prob('from:email addr:uk>') = 0.0488301 prob('thanks,') = 0.0300188 prob('subject:Hey') = 0.99 prob('today') = 0.852792 Return-Path: <janet691@cranfield.ac.uk> Delivered-To: bruce-spam@localhost Received: (qmail 14409 invoked by alias); 6 Mar 2002 20:07:42 -0000 Delivered-To: spam@bruce-guenter.dyndns.org Received: (qmail 14405 invoked from network); 6 Mar 2002 20:07:42 -0000 Received: from agamemnon.bfsmedia.com (204.83.201.2) by lorien.untroubled.org (192.168.1.3) with SMTP; 06 Mar 2002 20:07:42 -0000 Received: (qmail 13063 invoked by uid 500); 6 Mar 2002 20:02:05 -0000 Delivered-To: em-ca-spam@em.ca Received: (qmail 13057 invoked by uid 502); 6 Mar 2002 20:02:05 -0000 Delivered-To: bfsmedia-goose.kennels@bfsmedia.com Received: (qmail 13051 invoked from network); 6 Mar 2002 20:02:05 -0000 Received: from unknown (HELO smtp2.forserve.com) (63.170.11.221) by agamemnon.bfsmedia.com with SMTP; 6 Mar 2002 20:02:05 -0000 Date: Wed, 6 Mar 2002 15:12:41 -0500 Message-Id: <200203062012.g26KCfn08192@smtp2.forserve.com> X-Mailer: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1) Gecko/20010607 Reply-To: <janet691@cranfield.ac.uk> From: <janet691@cranfield.ac.uk> To: <goose01977@bellsouth.net> Subject: Hey Fred Content-Length: 95 Lines: 9 Fred, It was nice to talk to you today I will send the proposal tonight. Thanks, Heidi """ You figure it out <wink>. I suspect bfsmedia would have added a high spam score if I looked at Received lines, but even several additional strong spam indicators wouldn't be enough to nail this one. BTW, this msg shows up many times in the spam corpora, varying the "Fred" and "Heidi" with other male and female names; I assume this is a harvester that's trying to provoke the recipient into replying. Several others are damaged in ways such that the email pkg can't create a msg out of them. I could easily enough add code to force such a msg to be considered spam. Some are wildly embarrassing failures: """ Low prob spam! 0.000102019995919 Data/Spam/Set3/681.txt prob('common,') = 0.01 prob('definately') = 0.01 prob('logic') = 0.01 prob('hell,') = 0.01 prob('it".') = 0.01 prob('obvious.') = 0.01 prob('theory') = 0.01 prob('whilst') = 0.01 prob('earning') = 0.99 prob('same,') = 0.01 prob('$500,000') = 0.99 prob('"bull",') = 0.99 prob('year!!!') = 0.99 prob('internet!') = 0.99 prob('tv:') = 0.99 prob('*this') = 0.99 Return-Path: <ihrockrat3213@hotmail.com> Delivered-To: em-ca-bruceg@em.ca Received: (qmail 25721 invoked from network); 17 Aug 2002 01:05:07 -0000 Received: from unknown (HELO 65.102.48.161) (65.102.48.161) by churchill.factcomp.com with SMTP; 17 Aug 2002 01:05:07 -0000 Received: from unknown (149.89.93.47) by rly-xr02.mx.aol.com with NNFMP; Aug, 17 2002 1:50:22 AM -0800 Received: from anther.webhostingtalk.com ([88.58.121.118]) by da001d2020.lax-ca.osd.concentric.net with QMQP; Aug, 17 2002 12:40:13 AM -0700 Received: from 34.57.158.148 ([34.57.158.148]) by rly-xr02.mx.aol.com with local; Aug, 17 2002 12:02:05 AM +0300 From: rnpyjohn <ihrockrat3213@hotmail.com> To: Undisclosed Recipients Cc: Subject: Please read this letter carefully, it works 100% Sender: rnpyjohn <ihrockrat3213@hotmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Date: Sat, 17 Aug 2002 02:03:28 +0100 X-Mailer: The Bat! (v1.52f) Business X-Priority: 1 Content-Length: 15985 *This is a one time mailing and this list will never be used again.* Hi, SEEN THIS MAIL BEFORE?, SICK OF FINDING IT IN YOUR INBOX? ME TOO, HONEST I was exactly the same, till one day whilst i was complaining about how tired i was of seeing ... """ The first 16 most extreme indicators are split 9 highly in favor of ham (.01) and 7 highly in favor of spam (.99). If I hadn't folded case away to let stinking conference announcements through <wink>, I expect it would have latched on to the SCREAMING at the start instead of looking deeper. Looking at the To: line probably would nail this one too, as "Undisclosed Recipients" has two 0.99 spam indicators right there. Whatever, you *don't* want to look at msgs with a mix of just 0.99 and 0.01 thingies: it's not all that unusual to get such an extreme mix, in spam or ham. this-isn't-your-father's-idea-of-probability<wink>-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4