Hello, Nice discussion !!!!!! The 'algorythm' that you describe can use this : http://www.awaretek.com/python.html ? If you take a look at this, let me know and please envolve me ! Seb "Ken Seehof" <kens at sightreader.com> wrote in message news:mailman.987656068.4191.python-list at python.org... "Dan Maas" <dmaas at nospam.dcine.com> says: > > I've been saving up all the spam messages I get for the past two months. > > I have about 1869 spam messages saved. > > Now I'd like to develop a neural net based filter for my email program > > and train it to recognize these messages as spam. > > Cool... I assume the main thing you are worrying about is accidentally > rejecting non-spam emails, which might happen too easily with a > naive keyword-based system. > > How about this - apply a whole set of tests to the message. Each test > gives a "spammness" score - e.g. 10 points for being all caps, 50 points > for having the word 'viagara', 100 points for having a suspicious From: > address like *@yahoo.com. Add the scores from the different tests, and > if the sum exceeds, say, 200 points, then call it "spam." > > So, how do you figure out a good value for each test score? This is where > you could use a neural network or genetic algorithm. Pick a set of > scores, feed the program lots of messages (both spam and non-spam), and > see how accurate it is. Iterate until it rejects every spam email and > accepts every non-spam... > > Dan > -- > http://mail.python.org/mailman/listinfo/python-list Excellent idea, Dan. That's conveniently sidesteps the most difficult issue: getting the neural network to actually come up with linguistic rules. Once an intelligent human specifies the set of rules, the neural net should have no difficulty coming up with an optimal non-linear function of pre-processed features (i.e. the "rules") to identify spam. Analysis of the weights after training will help remove rules that turn out to be irrelevant. In other words, the input vector is simply the results from your arbitrary rule set. Since irrelevant rules are fairly harmless (other than decreasing performance), one could initialize it to include a rule for every word that occurs in spam messages more often than in non-spam messages. Then supplement it with rules like the ones you mention. Here's another idea for acquiring sample data. Send 'please send me more info' messages to everyone who has sent you spam, with your newly created spam recipient email address. Your address will probably be sold to everyone. BTW, make sure your spam recipient is on an ISP that does -not- defend against spam! (Technically, it's not actually spam you'd be receiving since you are explicitly requesting it, but close enough :-) I want to be involved in this project. Let's take this offline. - Ken ---------------------------------------------------- Copyright (c) 2001 by Ken Seehof This document may not be distributed, copied, duplicated, or replicated, or duplicated in any form without express permission by Ken Seehof. Permission is hereby granted. kseehof at neuralintegrator.com ---------------------------------------------------- The opinions expressed herein are not necessarily those of George W. Bush. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-list/attachments/20010419/2475c4ab/attachment.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4