"Stephen J. Turnbull" wrote: > > [What Is Unicode?] > > 1. Characters are "atomic units of text" that have properties. Since > they're atoms, we represent them by integers in computer programs. > Among the properties are their glyphs (graphical representation), > classes (alpha, num, whitespace, etc), and so on. It is a bad > idea to identify characters with their glyphs. > > 2. Alphabets are abstract sets of characters. Coded character sets > map characters to integer representations. "Encoding" is a > reasonable synonym for "coded character set". Avoid "charset" > except when talking about the charset parameter of Content-Type. > > 3. Typo in last sentence "I will suggest that YOU should use UTF-8." You might also want to grab some ideas from my "Python and Unicode" presentation I gave at Bordeaux last year: http://www.egenix.com/files/python/Unicode-Talk.pdf This also explains the various terms used in Unicode space and how they relate to Python. > [Email] > > 1. If you don't get a Content-Type charset parameter, you _must_ assume > US-ASCII. [WWW] 1. If you don't get a Content-Type charset parameter in an HTTP request, you _must_ assume Latin-1. [Console Input] I'd suggest to change the order of encodings (e.g. putting Latin-1 near the end isn't a good idea). Also, the fact that decoding works doesn't necessarily mean that the input did in fact use that encoding. A more appropriate way would be to try to reencode the decoded data in the given encoding since that is likely to fail for e.g. CP-1252 vs. Latin-1 if people use accented characters. If you're more into guessing an encoding, you should probably use an entropy approach: http://www.familie-holtwick.de/python/ > [Mildly Corrupt Data] Same comment here: you have to test round-trips, not just whether decoding fails. (Please note that not all codecs are round-trip safe -- see test_unicode.py for a list of ones that are) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4