Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > Character > > > > > > Used by itself, means the addressable units of a Python > > > Unicode string. > > > > Please add: also known as "code unit". > > I'm not entirely comfortable with that. As you yourself pointed out, the > same Python Unicode object can be interpreted as either a series of > single-width code points *or* as a UTF-16 string where the characters > are code units. You could also interpet it as a BASE64'd region or an > XML document... It all depends on how you look at it. Well, that's what code unit tries to capture too: it's the basic storage unit used by the implementation for storing characters. Never mind, it's just a detail... > > .... > > > Surrogate pair > > > > > > Two physical characters that represent a single logical > > > > Eeek... two code units (or have you ever seen a physical character > > walking around ;-) > > No, that's sort of my point. The user can decide to adopt the convention > of looking at the two characters as code units or they can ignore that > interpretation and look at them as two code points. It's all relative, > man. Dig it? That's why I use the word "convention" below: Ok. > > > character. Part of a convention for representing 32-bit > > > code points in terms of two 16-bit code points. > > "Surrogates are all in your head. Python doesn't know or care about > them!" > > I'll change this to: > > Surrogate pair > > Two Python Unicode characters that represent a single logical > Unicode code point. Part of a convention for representing > 32-bit code points in terms of two 16-bit code points. Python > has limited support for reading, writing and constructing > strings > that use this convention (described below). Otherwise Python > ignores the convention. Good. > > No need to pass this information to the codec: simply write > > a new one and give it a clear name, e.g. "ucs-2" will generate > > errors while "utf-16-le" converts them to surrogates. > > That's a good point, but what if I want a UTF-8 codec that doesn't > generate surrogates? Or even a UCS4 one? With Walter's patch for callback error handlers, you should be able to provide handlers which implement whatever you see fit. I think that codecs should work the same on all platforms and always apply the needed conversion for the platform in question; could be wrong though... it's really only a minor issue. > > Plus perhaps the Mark Davis paper at: > > > > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ > > Okay. > > > > Copyright > > > > > > This document has been placed in the public domain. > > > > Good work, Paul ! > > Thanks for your help. You did help me to clarify many things even though > I argued with you as I was doing it. Thank you for taking the suggestions into account. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4