Mark Hammond wrote: > > > I think his proposal will go a long way towards your toolkit. I > hope > > to hear soon from anybody who disagrees with Marc-Andre's proposal, > > No disagreement as such, but a small hole: > > >From the proposal: > > Internal Argument Parsing: > -------------------------- > ... > 's': For Unicode objects: auto convert them to the <default encoding> > and return a pointer to the object's <defencbuf> buffer. > > -- > Excellent - if someone passes a Unicode object, it can be > auto-converted to a string. This will allow "open()" to accept > Unicode strings. Well almost... it depends on the current value of <default encoding>. If it's UTF8 and you only use normal ASCII characters the above is indeed true, but UTF8 can go far beyond ASCII and have up to 3 bytes per character (for UCS2, even more for UCS4). With <default encoding> set to other exotic encodings this is likely to fail though. > However, there doesnt appear to be a reverse. Eg, if my extension > module interfaces to a library that uses Unicode natively, how can I > get a Unicode object when the user passes a string? If I had to > explicitely check for a string, then check for a Unicode on failure it > would get messy pretty quickly... Is it not possible to have "U" also > do a conversion? "U" is meant to simplify checks for Unicode objects, much like "S". It returns a reference to the object. Auto-conversions are not possible due to this, because they would create new objects which don't get properly garbage collected later on. Another problem is that Unicode types differ between platforms (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit wchar_t). Depending on the internal format of Unicode objects this could mean calling different conversion APIs. BTW, I'm still not too sure about the underlying internal format. The problem here is that Unicode started out as 2-byte fixed length representation (UCS2) but then shifted towards a 4-byte fixed length reprensetation known as UCS4. Since having 4 bytes per character is hard sell to customers, UTF16 was created to stuff the UCS4 code points (this is how character entities are called in Unicode) into 2 bytes... with a variable length encoding. Some platforms that started early into the Unicode business such as the MS ones use UCS2 as wchar_t, while more recent ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't yet checked in what ways the two are compatible (I would suspect the top bytes in UCS4 being 0 for UCS2 codes), but would like to hear whether it wouldn't be a better idea to use UTF16 as internal format. The latter works in 2 bytes for most characters and conversion to UCS2|4 should be fast. Still, conversion to UCS2 could fail. The downside of using UTF16: it is a variable length format, so iterations over it will be slower than for UCS4. Simply sticking to UCS2 is probably out of the question, since Unicode 3.0 requires UCS4 and we are targetting Unicode 3.0. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4