>>>>> "Skip" == Skip Montanaro <skip@pobox.com> writes: Skip> I began working on a Unicode HOWTO a few weeks ago, got a Skip> little ways on it, then ignored it until this morning. I Skip> added a little bit more to it then decided I should get some Skip> feedback. You can view it at Skip> http://www.musi-cal.com/~skip/unicode/ Thanks! A few comments. Overall, I like this intro. Technically it's horrible<wink> but I think it will hit your target audience where they live. [What Is Unicode?] 1. Characters are "atomic units of text" that have properties. Since they're atoms, we represent them by integers in computer programs. Among the properties are their glyphs (graphical representation), classes (alpha, num, whitespace, etc), and so on. It is a bad idea to identify characters with their glyphs. 2. Alphabets are abstract sets of characters. Coded character sets map characters to integer representations. "Encoding" is a reasonable synonym for "coded character set". Avoid "charset" except when talking about the charset parameter of Content-Type. 3. Typo in last sentence "I will suggest that YOU should use UTF-8." [Why UTF-8?] 1. Most programming languages are restricted to ASCII, except perhaps for user-defined identifiers. This means that programming tools need only be 8-bit clean to handle UTF without corruption.[1] 2. Space efficiency is _not_ an advantage of UTF-8 vs. UTF-16. ASCII and most Western European languages, yes. Greek, Hebrew, Arabic or Russian will be nearly a wash (whitespace, punctuation, and numerals give you what savings you're gonna get), and everybody east of Eden takes a 50% hit. The real tradeoff is "string == array of fixed-width object" semantics[2] vs upward compatibility from ASCII for languages where most tokens contain only ASCII. [Email] 1. If you don't get a Content-Type charset parameter, you _must_ assume US-ASCII. [Mildly Corrupt Data] 1. You can expect people to develop libraries for this kind of thing, but they are unlikely to be distributed. Suggest that newbies ask around. Footnotes: [1] This isn't quite true; consider the Lisp ?A notation for character literals. A naive byte-oriented parser will pick up only the leading byte of a non-ASCII UTF-8 character, and probably choke fatally on the trailing bytes. But Python, C, Java, et al don't have such literals---tokens with delimiters that are ASCII characters are safe, both strings and identifiers. You can ignore this issue. [2] Which UTF-16 actually doesn't give you! Grrr. -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN My nostalgia for Icon makes me forget about any of the bad things. I don't have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4