On 2008-04-22 18:33, Bill Janssen wrote: > The 2002 paper "A language and character set determination method > based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and > Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go > about this. Thanks for the reference. Looks like the existing research on this just hasn't made it into the mainstream yet. Here's their current project: http://www.language-observatory.org/ Looks like they are focusing more on language detection. Another interesting paper using n-grams: "Language Identification in Web Pages" by Bruno Martins and Mário J. Silva http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf And one using compression: "Text Categorization Using Compression Models" by Eibe Frank, Chang Chui, Ian H. Witten http://portal.acm.org/citation.cfm?id=789742 > They're looking at "LSE"s, language-script-encoding > triples; a "script" is a way of using a particular character set to > write in a particular language. > > Their system has these requirements: > > R1. the response must be either "correct answer" or "unable to detect" > where "unable to detect" includes "other than registered" [the > registered set of LSEs]; > > R2. Applicable to multi-LSE texts; > > R3. never accept a wrong answer, even when the program does not have > enough data on an LSE; and > > R4. applicable to any LSE text. > > So, no wrong answers. > > The biggest disadvantage would seem to be that the registration data > for a particular LSE is kind of bulky; on the order of 10,000 > shift-codons, each of three bytes, about 30K uncompressed. > > http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf For a server based application that doesn't sound too large. Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though. > Bill > >>> IMHO, more research has to be done into this area before a >>> "standard" module can be added to the Python's stdlib... and >>> who knows, perhaps we're lucky and by the time everyone is >>> using UTF-8 anyway :-) >> I walked over to our computational linguistics group and asked. This >> is often combined with language guessing (which uses a similar >> approach, but using characters instead of bytes), and apparently can >> usually be done with high confidence. Of course, they're usually >> looking at clean texts, not random "stuff". I'll see if I can get >> some references and report back -- most of the research on this was >> done in the 90's. >> >> Bill -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4