Guido van Rossum wrote: > > > First, a short one, Mark Hammond's patch for supporting MBCS on > > Windows. I trust everyone can handle a little bit of TeX markup? > > > > % XXX is this explanation correct? > > \item When presented with a Unicode filename on Windows, Python will > > now correctly convert it to a string using the MBCS encoding. > > Filenames on Windows are a case where Python's choice of ASCII as > > the default encoding turns out to be an annoyance. > > > > This patch also adds \samp{et} as a format sequence to > > \cfunction{PyArg_ParseTuple}; \samp{et} takes both a parameter and > > an encoding name, and converts it to the given encoding if the > > parameter turns out to be a Unicode string, or leaves it alone if > > it's an 8-bit string, assuming it to already be in the desired > > encoding. (This differs from the \samp{es} format character, which > > assumes that 8-bit strings are in Python's default ASCII encoding > > and converts them to the specified new encoding.) > > > > (Contributed by Mark Hammond with assistance from Marc-Andr\'e > > Lemburg.) > > I learned something here, so I hope this is correct. :-) The last part is... the rest is for Mark to comment on. > > Second, the --enable-unicode changes: > > > > %====================================================================== > > \section{Unicode Changes} > > > > Python's Unicode support has been enhanced a bit in 2.2. Unicode > > strings are usually stored as UCS-2, as 16-bit unsigned integers. > > Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned > > integers, as its internal encoding by supplying > > \longprogramopt{enable-unicode=ucs4} to the configure script. When > > built to use UCS-4, in theory Python could handle Unicode characters > > from U-00000000 to U-7FFFFFFF. > > I think the Unicode folks use U+, not U-, True. > and the largest Unicode > chracter is "only" U+10FFFF. (Never mind that the data type can > handle larger values.) I wouldn't count on that... (note that Andrew wrote "could" ;-) > > Being able to use UCS-4 internally is > > a necessary step to do that, but it's not the only step, and in Python > > 2.2alpha1 the work isn't complete yet. For example, the > > \function{unichr()} function still only accepts values from 0 to > > 65535, > > Untrue: it supports range(0x110000) (in UCS-2 mode this returns a > surrogate pair). Now, maybe that's not what it *should* do... It should definitely not, unless you want to break code which assumes that chr() and unichr() always return a single byte/code unit ! This was part of the UCS-4 checkins which hadn't had time yet to review. Should I remove the surrogate part for narrow builds ? > > and there's no \code{\e U} notation for embedding characters > > greater than 65535 in a Unicode string literal. > > Not true either -- correct \U has been part of Python since 2.0. It > does the same thing as unichr() described above. Right. Note that in this case, the handling of surrogates is needed to make the unicode-escape encoding roundtrip safe. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4