Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > > The term "character" in Python should really only be used for > > the 8-bit strings. > > Are we going to change chr() and unichr() to one_element_string() and > unicode_one_element_string() No. I am just suggesting to make use of the crispy clear definitions which the Unicode Consortium has developed for us. > u[i] is a character. If u is Unicode, then u[i] is a Python Unicode > character. No Python user will find that confusing no matter how Unicode > knuckle-dragging, mouth-breathing, wife-by-hair-dragging they are. Except that u[i] maps to a code unit which may or may not be a code point. Whether a code point matches a grapheme (this is what users tend to regard as character) is yet another story due to combining code points. > > In Unicode a "character" can mean any of: > > Mark Davis said that "people" can use the word to mean any of those > things. He did not say that it was imprecisely defined in Unicode. > Nevertheless I'm not using the Unicode definition anymore than our > standard library uses an ancient Greek definition of integer. Python has > a concept of integer and a concept of character. Ok, I'll stop whining. Just as final remark, let me say that our little discussion is a perfect example of how people can misunderstand each other by using the terms in different ways (Kant tried to solve this for Philosophy and did not succeed; so I guess the Unicode Consortium doesn't stand a chance either ;-) > > > It has been proposed that there should be a module for working > > > with UTF-16 strings in narrow Python builds through some sort of > > > abstraction that handles surrogates for you. If someone wants > > > to implement that, it will be another PEP. > > > > Uhm, narrow builds don't support UTF-16... it's UCS-2 which > > is supported (basically: store everything in range(0x10000)); > > the codecs can map code points to surrogates, but it is solely > > their responsibility and the responsibility of the application > > using them to take care of dealing with surrogates. > > The user can view the data as UCS-2, UTF-16, Base64, ROT-13, XML, .... > Just as we have a base64 module, we could have a UTF-16 module that > interprets the data in the string as UTF-16 and does surrogate > manipulation for you. > > Anyhow, if any of those is the "real" encoding of the data, it is > UTF-16. After all, if the codec reads in four non-BMP characters in, > let's say, UTF-8, we represent them as 8 narrow-build Python characters. > That's the definition of UTF-16! But it's easy enough for me to take > that word out so I will. u[i] gives you a code unit and whether this maps to a code point or not is dependent on the implementation which in turn depends on the narrow/wide choice. In UCS-2, I believe, surrogates are regarded as two code points; in UTF-16 they always have to come in pairs. There's a semantic difference here which is for the codecs and these additional tools to be aware of -- not the Unicode type implementation. > >... > > Also, the module will be useful for both narrow and wide builds, > > since the notion of an encoded character can involve multiple code > > points. In that sense Unicode is always a variable length > > encoding for characters and that's the application field of > > this module. > > I wouldn't advise that you do all different types of normalization in a > single module but I'll wait for your PEP. I'll see if I find some time at the Bordeaux Python Meeting next week. > > Here's the adjusted text: > > > > It has been proposed that there should be a module for working > > with Unicode objects using character-, word- and line- based > > indexing. The details of the implementation is left to > > another PEP. > > It has been proposed that there should be a module that handles > surrogates in narrow Python builds for programmers. If someone > wants to implement that, it will be another PEP. It might also be > combined with features that allow other kinds of character-, > word- and line- based indexing. Hmm, I liked my version better, but what the heck ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4