"Ken Seehof" <kens at sightreader.com> wrote in message news:mailman.987656068.4191.python-list at python.org... [snip -- some quoting-level problems -- Ken's quoting Dan] """ > How about this - apply a whole set of tests to the message. Each test > gives a "spammness" score - e.g. 10 points for being all caps, 50 points > for having the word 'viagara', 100 points for having a suspicious From: > address like *@yahoo.com. Add the scores from the different tests, and > if the sum exceeds, say, 200 points, then call it "spam." > > So, how do you figure out a good value for each test score? This is where > you could use a neural network or genetic algorithm. Pick a set of > scores, feed the program lots of messages (both spam and non-spam), and > see how accurate it is. Iterate until it rejects every spam email and > accepts every non-spam... """ There may not exist a vector of feature weights that performs perfectly, of course. What one generally wants is a vector of feature weights that _optimizes_ some performance score. """ Excellent idea, Dan. That's conveniently sidesteps the most difficult issue: getting the neural network to actually come up with linguistic rules. Once an intelligent human specifies the set of rules, the neural """ Right. Extracting the features for classification is an order of magnitude harder that weighing them optimally. My old-fashioned approach to such feature-weighting problems is to apply a general-purpose optimization algorithm (simulated annealing, for choice). That's easy to code/test/tune and lets me experiment with all sort of "weird" nonlinearities in the classification engine, as long as I can get a classifier that takes a vector of N real parameters and can be run on the training set to produce a classification whose 'cost' is then measurable. False-positives and false-negatives can of course easily be given different costs in this approach, and in some cases being able to get a three-way classifier (yes/no/dunno, with some cost for each dunno answer of course) can be important. A faithful Python transcription of Goffe's Fortran tutorial program for simulated annealing (the Fortran original is at http://emlab.berkeley.edu/Software/abstracts/goffe895.html) turns out to be less than 600 lines, over half of which are docstrings, comments and printing-functions that only exist to help gain understanding about the algorithm, the function one is studying, etc. Unfortunately, I'm not sure I can redistribute that transcription, given Goffe's copyright -- it IS a derived work of his copyrighted one. It could of course be redone in a more Pythonical mold, and to use some underlying extension module if available (I am not aware of other Simulated Annealing implementations in Python, or as Python extension modules, at this time, although of course it's likely that many exist -- but I can't find them on the net!). I have written Dr Goffe asking for permission, and I think I can in the meantime email sa.py privately (though not "publish" it) if requested. Alex
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4