On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan <ncoghlan at gmail.com> wrote: > On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <guido at python.org> wrote: >> Now I am happy to admit that for many Unicode issues the level at >> which we have currently defined things (code units, I think -- the >> thingies that encodings are made of) is confusing, and it would be >> better to switch to the others (code points, I think). But characters >> are right out. > > Indeed, code points are the abstract concept and code units are the > specific byte sequences that are used for serialisation (FWIW, I'm > going to try to keep this straight in the future by remembering that > the Unicode character set is defined as abstract points on planes, > just like geometry). Hm, code points still look pretty concrete to me (integers in the range 0 .. 2**21) and code units don't feel like byte sequences to me (at least not UTF-16 code units -- in Python at least you can think of them as integers in the range 0 .. 2**16). > With narrow builds, code units can currently come into play > internally, but with PEP 393 everything internal will be working > directly with code points. Normalisation, combining characters and > bidi issues may still affect the correctness of unicode comparison and > slicing (and other text manipulation), but there are limits to how > much of the underlying complexity we can effectively hide without > being misleading. Let's just define a Unicode string to be a sequence of code points and let libraries deal with the rest. Ok, methods like lower() should consider characters, but indexing/slicing should refer to code points. Same for '=='; we can have a library that compares by applying (or assuming?) certain normalizations. Tom C tells me that case-less comparison cannot use a.lower() == b.lower(); fine, we can add that operation to the library too. But this exceeds the scope of PEP 393, right? -- --Guido van Rossum (python.org/~guido)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4