"Da Silva, Mike" wrote: > > Most of the ASCII string functions do indeed work for UTF-8. I have made > extensive use of this feature when writing translation logic to harmonize > ASCII text (an SQL statement) with substitution parameters that must be > converted from IBM EBCDIC code pages (5035, 1027) into UTF8. Since UTF-8 is > a superset of ASCII, this all works fine. > > Some of the character classification functions etc can be flaky when used > with UTF8 characters outside the ASCII range, but simple string operations > work fine. That's why there's the <defencbuf> buffer which holds the UTF-8 encoded value... > As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an > internal string representation are: > > 1. UTF-8 allows all characters to be displayed (in some form or other) > on the users machine, with or without native fonts installed. Naturally > anything outside the ASCII range will be garbage, but it is an immense > debugging aid when working with character encodings to be able to touch and > feel something recognizable. Trying to decode a block of raw UTF-16 is a > pain. True. > 2. UTF-8 works with most existing string manipulation libraries quite > happily. It is also portable (a char is always 8 bits, regardless of > platform; wchar_t varies between 16 and 32 bits depending on the underlying > operating system (although unsigned short does seems to work across > platforms, in my experience). You mean with the compiler applying the needed 16->32 bit extension ? > 3. UTF-16 has some advantages in providing fixed width characters and, > (ignoring surrogate pairs etc) a modeless encoding space. This is an > advantage for fast string operations, especially on CPU's that have > efficient operations for handling 16bit data. Right and this is major argument for using 16 bit encodings without state internally. > 4. UTF-16 would directly support a tightly coupled character properties > engine, which would enable Unicode compliant case folding and character > decomposition to be performed without an intermediate UTF-8 <----> UTF-16 > translation step. Could you elaborate on this one ? It is one of the open issues in the proposal. > 5. UTF-16 requires string operations that do not make assumptions about > nulls - this means re-implementing most of the C runtime functions to work > with unsigned shorts. AFAIK, the RE engines in Python are 8-bit clean... BTW, wouldn't it be possible to take pcre and have it use Py_Unicode instead of char ? [Of course, there would have to be some extensions for character classes etc.] -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4