Paul Prescod wrote: > > PEP: 261 > Title: Support for "wide" Unicode characters > Version: $Revision: 1.3 $ > Author: paulp@activestate.com (Paul Prescod) > Status: Draft > Type: Standards Track > Created: 27-Jun-2001 > Python-Version: 2.2 > Post-History: 27-Jun-2001 > > Abstract > > Python 2.1 unicode characters can have ordinals only up to 2**16 > -1. > This range corresponds to a range in Unicode known as the Basic > Multilingual Plane. There are now characters in Unicode that live > on other "planes". The largest addressable character in Unicode > has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we > will call this TOPCHAR and call characters in this range "wide > characters". > > Glossary > > Character > > Used by itself, means the addressable units of a Python > Unicode string. Please add: also known as "code unit". > Code point > > A code point is an integer between 0 and TOPCHAR. > If you imagine Unicode as a mapping from integers to > characters, each integer is a code point. But the > integers between 0 and TOPCHAR that do not map to > characters are also code points. Some will someday > be used for characters. Some are guaranteed never > to be used for characters. > > Codec > > A set of functions for translating between physical > encodings (e.g. on disk or coming in from a network) > into logical Python objects. > > Encoding > > Mechanism for representing abstract characters in terms of > physical bits and bytes. Encodings allow us to store > Unicode characters on disk and transmit them over networks > in a manner that is compatible with other Unicode software. > > Surrogate pair > > Two physical characters that represent a single logical Eeek... two code units (or have you ever seen a physical character walking around ;-) > character. Part of a convention for representing 32-bit > code points in terms of two 16-bit code points. > > Unicode string > > A Python type representing a sequence of code points with > "string semantics" (e.g. case conversions, regular > expression compatibility, etc.) Constructed with the > unicode() function. > > Proposed Solution > > One solution would be to merely increase the maximum ordinal > to a larger value. Unfortunately the only straightforward > implementation of this idea is to use 4 bytes per character. > This has the effect of doubling the size of most Unicode > strings. In order to avoid imposing this cost on every > user, Python 2.2 will allow the 4-byte implementation as a > build-time option. Users can choose whether they care about > wide characters or prefer to preserve memory. > > The 4-byte option is called "wide Py_UNICODE". The 2-byte option > is called "narrow Py_UNICODE". > > Most things will behave identically in the wide and narrow worlds. > > * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a > length-one string. > > * unichr(i) for 2**16 <= i <= TOPCHAR will return a > length-one string on wide Python builds. On narrow builds it will > raise ValueError. > > ISSUE > > Python currently allows \U literals that cannot be > represented as a single Python character. It generates two > Python characters known as a "surrogate pair". Should this > be disallowed on future narrow Python builds? > > Pro: > > Python already the construction of a surrogate pair > for a large unicode literal character escape sequence. > This is basically designed as a simple way to construct > "wide characters" even in a narrow Python build. It is also > somewhat logical considering that the Unicode-literal syntax > is basically a short-form way of invoking the unicode-escape > codec. > > Con: > > Surrogates could be easily created this way but the user > still needs to be careful about slicing, indexing, printing > etc. Therefore some have suggested that Unicode > literals should not support surrogates. > > ISSUE > > Should Python allow the construction of characters that do > not correspond to Unicode code points? Unassigned Unicode > code points should obviously be legal (because they could > be assigned at any time). But code points above TOPCHAR are > guaranteed never to be used by Unicode. Should we allow > access > to them anyhow? > > Pro: > > If a Python user thinks they know what they're doing why > should we try to prevent them from violating the Unicode > spec? After all, we don't stop 8-bit strings from > containing non-ASCII characters. > > Con: > > Codecs and other Unicode-consuming code will have to be > careful of these characters which are disallowed by the > Unicode specification. > > * ord() is always the inverse of unichr() > > * There is an integer value in the sys module that describes the > largest ordinal for a character in a Unicode string on the current > interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds > of Python and TOPCHAR on wide builds. > > ISSUE: Should there be distinct constants for accessing > TOPCHAR and the real upper bound for the domain of > unichr (if they differ)? There has also been a > suggestion of sys.unicodewidth which can take the > values 'wide' and 'narrow'. > > * every Python Unicode character represents exactly one Unicode code > point (i.e. Python Unicode Character = Abstract Unicode > character). > > * codecs will be upgraded to support "wide characters" > (represented directly in UCS-4, and as variable-length sequences > in UTF-8 and UTF-16). This is the main part of the implementation > left to be done. > > * There is a convention in the Unicode world for encoding a 32-bit > code point in terms of two 16-bit code points. These are known > as "surrogate pairs". Python's codecs will adopt this convention > and encode 32-bit code points as surrogate pairs on narrow Python > builds. > > ISSUE > > Should there be a way to tell codecs not to generate > surrogates and instead treat wide characters as > errors? > > Pro: > > I might want to write code that works only with > fixed-width characters and does not have to worry about > surrogates. > > Con: > > No clear proposal of how to communicate this to codecs. No need to pass this information to the codec: simply write a new one and give it a clear name, e.g. "ucs-2" will generate errors while "utf-16-le" converts them to surrogates. > * there are no restrictions on constructing strings that use > code points "reserved for surrogates" improperly. These are > called "isolated surrogates". The codecs should disallow reading > these from files, but you could construct them using string > literals or unichr(). > > Implementation > > There is a new (experimental) define: > > #define PY_UNICODE_SIZE 2 > > There is a new configure option: > > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses > wchar_t if it fits > --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses > whchar_t if it fits > --enable-unicode same as "=ucs2" > > The intention is that --disable-unicode, or --enable-unicode=no > removes the Unicode type altogether; this is not yet implemented. > > It is also proposed that one day --enable-unicode will just > default to the width of your platforms wchar_t. > > Windows builds will be narrow for a while based on the fact that > there have been few requests for wide characters, those requests > are mostly from hard-core programmers with the ability to buy > their own Python and Windows itself is strongly biased towards > 16-bit characters. > > Notes > > This PEP does NOT imply that people using Unicode need to use a > 4-byte encoding for their files on disk or sent over the network. > It only allows them to do so. For example, ASCII is still a > legitimate (7-bit) Unicode-encoding. > > It has been proposed that there should be a module that handles > surrogates in narrow Python builds for programmers. If someone > wants to implement that, it will be another PEP. It might also be > combined with features that allow other kinds of character-, > word- and line- based indexing. > > Rejected Suggestions > > More or less the status-quo > > We could officially say that Python characters are 16-bit and > require programmers to implement wide characters in their > application logic by combining surrogate pairs. This is a heavy > burden because emulating 32-bit characters is likely to be > very inefficient if it is coded entirely in Python. Plus these > abstracted pseudo-strings would not be legal as input to the > regular expression engine. > > "Space-efficient Unicode" type > > Another class of solution is to use some efficient storage > internally but present an abstraction of wide characters to > the programmer. Any of these would require a much more complex > implementation than the accepted solution. For instance consider > the impact on the regular expression engine. In theory, we could > move to this implementation in the future without breaking > Python > code. A future Python could "emulate" wide Python semantics on > narrow Python. Guido is not willing to undertake the > implementation right now. > > Two types > > We could introduce a 32-bit Unicode type alongside the 16-bit > type. There is a lot of code that expects there to be only a > single Unicode type. > > This PEP represents the least-effort solution. Over the next > several years, 32-bit Unicode characters will become more common > and that may either convince us that we need a more sophisticated > solution or (on the other hand) convince us that simply > mandating wide Unicode characters is an appropriate solution. > Right now the two options on the table are do nothing or do > this. > > References > > Unicode Glossary: http://www.unicode.org/glossary/ Plus perhaps the Mark Davis paper at: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ > Copyright > > This document has been placed in the public domain. Good work, Paul ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4