Terry Reedy wrote: > "Fredrik Lundh" <fredrik at pythonware.com> wrote in message > news:ci3g2d$m3g$1 at sea.gmane.org... > >>usually shorter in languages with many ideographs (my non-scientific >>tests indicate that chinese text uses about 4 times less symbols than >>english; I'm sure someone can dig up better figures). > > This is why I am not especially enamored of Unicode and the prospect of > Python becoming married to it. It is heavily weighted in favor of > efficiently representing Chinese and inefficiently representing English. Hmm, the Asian world has a very different view on these things. Representing English ASCII text in UTF-8 is very efficient (1-1), while typical Asian texts use between 1.5-2 times as much space as their equivalent in one of the resp. Asian encodings, e.g. take the Japanese translation of the bible from (only parts of New Testament): http://www.cozoh.org/denmo/ >>> bible = unicode(open('denmo.txt', 'rb').read(), 'shift-jis') >>> len(bible) 386980 >>> len(bible.encode('utf-8')) 1008272 >>> len(bible.encode('shift-jis')) 697626 Some stats: ----------- Number of unique code points: 1512 Code point frequency (truncated): u'\u305f' : ================================= u' ' : ============================= u'\u306e' : =========================== u'\uff0c' : ========================== u'\r' : ======================== u'\n' : ======================== u'\u306b' : ===================== u'\u3044' : ================= u'\u3066' : ================= u'\u3057' : ================ u'\u3002' : ================ u'\u306f' : ================ u'\u306a' : =============== u'\u3092' : ============== u'\u3068' : ============ u'\u308b' : ============ u'\u3089' : =========== u'\u3063' : =========== u':' : =========== u'}' : =========== u'{' : =========== u'\u304c' : ========== u'\u308c' : ========== u'\u304b' : ========= u'\u3067' : ========= u'1' : ========= u'\u5f7c' : ======== u'\u3053' : ======== u'\u3042' : ======= u'\u3061' : ======= u'\u3046' : ======= u'2' : ======= ... As you can see, most code points live in the 0x3000 area. These code points require 3 bytes in UTF-8, 2 bytes in UTF-16. > To give English equivalent treatment, the 20,000 or so most common words, > roots, prefixes, and suffixes would each get its own codepoint. I suggest you take this one up with the Unicode Consortium :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 14 2004) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4