Most of the ASCII string functions do indeed work for UTF-8. I have made extensive use of this feature when writing translation logic to harmonize ASCII text (an SQL statement) with substitution parameters that must be converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is a superset of ASCII, this all works fine. Some of the character classification functions etc can be flaky when used with UTF8 characters outside the ASCII range, but simple string operations work fine. As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an internal string representation are: 1. UTF-8 allows all characters to be displayed (in some form or other) on the users machine, with or without native fonts installed. Naturally anything outside the ASCII range will be garbage, but it is an immense debugging aid when working with character encodings to be able to touch and feel something recognizable. Trying to decode a block of raw UTF-16 is a pain. 2. UTF-8 works with most existing string manipulation libraries quite happily. It is also portable (a char is always 8 bits, regardless of platform; wchar_t varies between 16 and 32 bits depending on the underlying operating system (although unsigned short does seems to work across platforms, in my experience). 3. UTF-16 has some advantages in providing fixed width characters and, (ignoring surrogate pairs etc) a modeless encoding space. This is an advantage for fast string operations, especially on CPU's that have efficient operations for handling 16bit data. 4. UTF-16 would directly support a tightly coupled character properties engine, which would enable Unicode compliant case folding and character decomposition to be performed without an intermediate UTF-8 <----> UTF-16 translation step. 5. UTF-16 requires string operations that do not make assumptions about nulls - this means re-implementing most of the C runtime functions to work with unsigned shorts. Regards, Mike da Silva -----Original Message----- From: Greg Stein [SMTP:gstein@lyra.org] Sent: 12 November 1999 10:30 To: Tim Peters Cc: python-dev@python.org Subject: RE: [Python-Dev] Internationalization Toolkit On Fri, 12 Nov 1999, Tim Peters wrote: >... > Using UTF-8 internally is also reasonable, and if it's being rejected on the > grounds of supposed slowness No... my main point was interaction with the underlying OS. I made a SWAG (Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower for various types of operations. As always, your infernal meddling has dashed that hypothesis, so I must retreat... >... > I expect either would work well. It's at least curious that Perl and Tcl > both went with UTF-8 -- does anyone think they know *why*? I don't. The > people here saying UCS-2 is the obviously better choice are all from the > Microsoft camp <wink>. It's not obvious to me, but then neither do I claim > that UTF-8 is obviously better. Probably for the exact reason that you stated in your messages: many 8-bit (7-bit?) functions continue to work quite well when given a UTF-8-encoded string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter to deal with a new string type. I'd guess it is a helluva lot easier for us to add a Python Type than for Perl or TCL to whack around with new string types (since they use strings so heavily). Cheers, -g -- Greg Stein, http://www.lyra.org/ _______________________________________________ Python-Dev maillist - Python-Dev@python.org http://www.python.org/mailman/listinfo/python-dev
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4