A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-March/021061.html below:

[Python-Dev] PEP 263 considered faulty (for some Japanese)

[Python-Dev] PEP 263 considered faulty (for some Japanese)Tom Emerson tree@basistech.com
Wed, 13 Mar 2002 09:41:01 -0500
Stephen J. Turnbull writes:
> >>>>> "Martin" =3D=3D Martin v Loewis <martin@v.loewis.de> writes:
>=20
>     Martin> Reliable detection of encodings is a good thing, though,
>=20
> I would think that UTF-8 can be quite reliably detected without the
> "BOM".

Detecting UTF-8 is relatively straightforward: Martin D=FCrst has
presented on this at the last few Unicode conferences. Implementing
this is trivial to anyone who thinks about it.

> I suppose you could construct short ambiguous sequences easily for
> ISO-8859-[678] (which are meaningful in the corresponding natural
> language), but it seems that even a couple dozen characters would mak=
e
> the odds astronomical that "in the wild" syntactic UTF-8 is intended
> to be UTF-8 Unicode (assuming you're expecting a text file, such as
> Python source).  Is that wrong=3F  Have you any examples=3F  I'd be
> interested to see them; we (XEmacs) have some ideas about
> "statistical" autodetection of encodings, and they'd be useful test
> cases.

The problem with the ISO-8859-x is that the encoding space is
identical for all of them, making it difficult without statistical or
lexical methods to determine which you are looking at. EUC-CN and
EUC-KR have a similar problem: just looking at the bytes themselves
you cannot immediately tell whether a document is Chinese or
Korean. Compare this Big 5, ShiftJIS, or any of the ISO-2022 encodings
where it is pretty easy to detect these just by looking at the byte
sequences.

There are a bunch of statistical language/encoding identifiers out in
the wild, and frankly most of them suck for real text. Anyone working
in the space usually starts with Cavnar and Trenkle's "N-Gram-Based
Text Categorization", then train it with whatever random data they
have (http://odur.let.rug.nl/~vannoord/TextCat/competitors.html has a
list of tools). Finding enough text in the languages you are
interested in can be hard. For example, Lextek claims to support 260
languages. If you examine the list shows that they used the UN HCR
text as their training corpus: languages whose UNHCR translation is
provided as GIFs or PDFs are not included in Lextek's tool. So, while
it can detect text written in a relatively obscure South American
language, it does not detect Chinese, Japanese, Korean, or
Arabic. Further, because of the small corpus size, it is very easy to
confuse it. My standard test for a language/encoding identifier is to
type the string in UPPER CASE. For example, go to

http://odur.let.rug.nl/~vannoord/TextCat/Demo/

and enter

This is a test of the emergency broadcasting system.

and it will decide that the text is in English. If you enter

THIS IS A TEST OF THE EMERGENCY BROADCASTING SYSTEM.

then it cannot identify the text. At least it says that much. Lextek's
identifier identifies that text as Catalan or something.

The other issue to deal with when finding training data is its
cleanliness. Spidering the web can be hard because English is
everywhere. If you don't strip markup, then the markup can overwhelm
the text and result in a bogus profile.

Anyway, statistical detection is good and doable, but it has to be
thought out, and you need enough data (we use at least 5 megabytes,
and often 10-15 megabytes, of clean text for each language and
encoding pair we support in our detector) to build a good model. The
more languages/encodings you support, the more data you need.

> But the Web in general provides (mandatory) protocols for identifying=

> content-type, yet I regularly see HTML files with incorrect http-equi=
v
> meta elements, and XHTML with no encoding declaration containing Shif=
t
> JIS.  Microsoft software for Japanese apparently ignores Content-Type=

> headers and the like in favor of autodetection (probably because the
> same MS software regularly relies on users to set things like charset=

> parameters in MIME Content-Type).

Mandatory protocols are useless if no one actually pays attention to
them. That is why browser manufacturers generally ignore the
Content-Type header. At the very least if something claims to be
ISO-8859-1, it probably isn't. And worse than an XHTML document with
no encoding declaration containing ShiftJIS, I've seen XHTML documents
that explicitly declare UTF-8 that contain ShiftJIS.

How does Java deal with this=3F Are all files required to be in UTF-8=3F=


--=20
Tom Emerson                                          Basis Technology C=
orp.
Sr. Computational Linguist                         http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4