On 11/23/2010 2:11 PM, Alexander Belopolsky wrote: > This discussion motivated me to start looking into how well Python > library itself is prepared to deal with len(chr(i)) = 2. I was not Good idea! > surprised to find that textwrap does not handle the issue that well: > >>>> len(wrap(' \U00010140' * 80, 20)) > 12 >>>> len(wrap(' \U00000140' * 80, 20)) > 8 How well does textwrap handles composable pairs (letter + accent)? Does is count two codepoints as one char space? and avoid putting line breaks between? I suspect textwrap should be regarded as (extended?)_ascii_textwrap. > > That module should probably be rewritten to properly implement the > Unicode line breaking algorithm > <http://unicode.org/reports/tr14/tr14-22.html>. Probably a good idea > Yet finding a bug in a str object method after a 5 min review was a > bit discouraging: > >>>> 'xyz'.center(20, '\U00010140') > Traceback (most recent call last): > File "<stdin>", line 1, in<module> > TypeError: The fill character must be exactly one character long Again, what does it do with letter + decorator combinations? It seems to me that the whole notion that one code point == one printed character space is broken once one leaves ascii. Perhaps we need an is_uchar function to recognize multi-code sequences, inclusing surrogate pairs, that represent one char for the purpose of character oriented functions. > Given the apparent difficulty of writing even basic text processing > algorithms in presence of surrogate pairs, I wonder how wise it is to > expose Python users to them. As Wikipedia explains, [1] > > """ > Because the most commonly used characters are all in the Basic > Multilingual Plane, converting between surrogate pairs and the > original values is often not tested thoroughly. This leads to > persistent bugs, and potential security holes, even in popular and > well-reviewed application software. > """ So we did not test thoroughly enough and need to add appropriate unit tests as bugs are fixed. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4