Le 24/08/2011 02:46, Terry Reedy a écrit : > On 8/23/2011 9:21 AM, Victor Stinner wrote: >> Le 23/08/2011 15:06, "Martin v. Löwis" a écrit : >>> Well, things have to be done in order: >>> 1. the PEP needs to be approved >>> 2. the performance bottlenecks need to be identified >>> 3. optimizations should be applied. >> >> I would not vote for the PEP if it slows down Python, especially if it's >> much slower. But Torsten says that it speeds up Python, which is >> surprising. I have to do my own benchmarks :-) > > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used. But now there are many more, > with more being added to each Unicode edition. They include cursive Math > letters that are used in English documents today. The problem will > slowly get worse and Python, at least on Windows, will become a language > to avoid for dependable Unicode document processing. 3.x needs a proper > Unicode implementation that works for all strings on all builds. I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10ffff)) now displays '\U0010ffff' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n). > utf16.py, attached to http://bugs.python.org/issue12729 > prototypes a different solution than the PEP for the above problems for > the 'mostly BMP' case. I will discuss it in a different post. Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+0000-U+10FFFF) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4