On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote: > Terry Reedy writes: > > > Excuse me for believing the fine 3.2 manual that says > > "Strings contain Unicode characters." > > The manual is wrong, then, subject to a pronouncement to the contrary, Please suggest a re-wording then, as it is a bug for doc and behavior to disagree. > > For the purpose of my sentence, the same thing in that code points > > correspond to characters, > > Not in Unicode, they do not. By definition, a small number of code > points (eg, U+FFFF) *never* did and *never* will correspond to > characters. On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point: 1) i in range(0x11000) <broad definition> 2) "A value, or position, for a character" <narrow definition> (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(. > > Any narrow build string with even 1 non-BMP char violates the > > standard. > > Yup. That's by design. [...] > Sure. Nevertheless, practicality beat purity long ago, and that > decision has never been rescinded AFAIK. I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers. > The thing is, that 90% of applications are not really going to care > about full conformance to the Unicode standard. I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors. > > Currently, the meaning of Python code differs on narrow versus wide > > build, and in a way that few users would expect or want. > > Let them become developers, then, and show us how to do it better. I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing. > No, I do like the PEP. However, it is only a step, a rather > conservative one in some ways, toward conformance to the Unicode > character model. In particular, it does nothing to resolve the fact > that len() will give different answers for character count depending > on normalization, and that slicing and indexing will allow you to cut > characters in half (even in NFC, since not all composed characters > have fully composed forms). I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4