On 2008-07-03 19:21, Adam Olsen wrote: > On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote: >> On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: >>> -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote: >>>> Unicode if full of combining code points - if you break such a sequence, >>>> the output will be just as wrong; regardless of UCS2 vs. UCS4. >>> In my opinion you are confusing two related, but very separated things >>> here. >>> Combining characters have nothing to do with breaking up the encoding of a >>> single codepoint. Sure enough, if you arbitrary slice up codepoints that >>> consist of combining characters then your result is indeed odd looking. >>> >>> I never said that nor is that the point I am making. >> Please remember that lone surrogate pair code points are perfectly >> valid Unicode code points, nevertheless. Just as a lone combining >> code point is valid on its own. > > That is a big part of these problems. For all practical purposes, a > surrogate is like a UTF-8 code unit, and must be handled the same way, > so why the heck do they confuse everybody by saying "oh, it's a code > point too!"? You have to take that up with the Unicode consortium :-) It would have been better not to add surrogates to the standard at all. To be fair, I don't think that anybody seriously assumed at the time that more than 16 bits would be needed. In practice, you do need to be able to build Unicode strings that contain half a surrogate (ie. a single code point) or a combining code point without its anchor code point, so trying to be smart about detecting surrogates is going to create more confusion than do good, e.g. >>> x1 = u'\udbc0' >>> x2 = u'\udc00' >>> x1 u'\udbc0' >>> x2 u'\udc00' >>> len(x1) 1 >>> len(x2) 1 Having len(x1+x2) == 1 wouldn't be right and break all sorts of assumptions you normally make about string concatenation. Which is why len(x1+x2) gives 2 in both UCS2 and UCS4 builds. The fact that u'\U00100000' can map to a length 1 Unicode string in UCS4 builds and a length 2 string in UCS2 builds is merely due to the fact that the unicode-escape codec (which converts the escaped string literal to a Unicode object) does know about surrogates and uses them to avoid exceptions. Programmers need to be aware of this fact, that's all... just like they need to aware of differences between integer and float division, different behavior of classic and new-style classes, etc. etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 03 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4