On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: > -On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote: >> Unicode if full of combining code points - if you break such a sequence, >> the output will be just as wrong; regardless of UCS2 vs. UCS4. > > In my opinion you are confusing two related, but very separated things here. > Combining characters have nothing to do with breaking up the encoding of a > single codepoint. Sure enough, if you arbitrary slice up codepoints that > consist of combining characters then your result is indeed odd looking. > > I never said that nor is that the point I am making. Please remember that lone surrogate pair code points are perfectly valid Unicode code points, nevertheless. Just as a lone combining code point is valid on its own. > Guido points out that Python supports surrogate pairs and says that if > Python is dealing wrongly with this in the core than it needs to be fixed. > I am pointing out that given the fact we allow surrogate pairs we deal > rather simplistic with it in the core. In fact, we do not consider them at > all. In essence: though we may accept full 21-bit codepoints in the form of > \U00000000 escape sequences and store them internally as UTF-16 (which I > still need to verify) we subsequently deal with them programmatically as > UCS-2, which is plain silly. Python applies conversion from non-BMP code points to surroagtes for UCS builds in a few places and I agree that we should probably do that at a few more places. However, these are mainly conversion issues of encoded Unicode representations vs. the internal Unicode storage where you want to avoid exceptions in favor of finding a solution that preserves data. To make it clear: UCS2 builds of Python do not support non-BMP code points out of the box. A programmer will always have to use a codec to map the internal storage on these builds to the full Unicode code point range. The following codecs support surrogates on UCS2 builds: * UTF-8 * UTF-16 * UTF-32 * unicode-escape * raw-unicode-escape > You either commit yourself fully to UTF-16 and surrogate pairs or not. Not > some form in-between, because that will ultimately lead to more confusion > due to the difference in results when dealing with Unicode. Programmers will have to be aware of the fact that on UCS2 builds of Python non-BMP code points will have to be treated differently than on UCS4 builds. I don't see that as a problem. It is in a way similar to 32-bit vs. 64-bit builds of Python or the fact that floating point numbers work differently depending on the Python platform or compiler being used. BTW: Have you ever run into any problems with UCS2 vs. UCS4 in practice that were not easy to solve ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 03 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4