On Wed, Nov 24, 2010 at 1:50 PM, M.-A. Lemburg <mal at egenix.com> wrote: .. >> add an option for decoders that currently produce surrogate pairs to >> treat non-BMP characters as errors and handle them according to user's >> choice. > > But what do you gain by doing this ? You'd lose the round-trip > safety of those codecs and that's not a good thing. > Any non-trivial text processing is likely to be broken in presence of surrogates. Producing them on input is just trading known issue for an unknown one. Processing surrogate pairs in python code is hard. Software that has to support non-BMP characters will most likely be written for a wide build and contain subtle bugs when run under a narrow build. Note that my latest proposal does not abolish surrogates outright. Users who want them can still use something like "surrogateescape" error handler for non-BMP characters. > Since we're not going change the semantics of those APIs, > it is OK to not support padding with non-BMP code points on > UCS-2 builds. > Well, I think more users are willing to accept slightly misaligned text in their web-app logs than those willing to cope with Traceback (most recent call last): ... TypeError: The fill character must be exactly one character long there. Yes, allowing non-trusted users to specify fill character is unlikely, but it is quite likely that naive slicing or iteration over string units would result in Traceback (most recent call last): ... UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed > Supporting such cases would only cause problems: > > * if the methods would pad with surrogates, the resulting > string would no longer have length n; breaking the > assumption that len(str.center(n)) == n > I agree, but how is this different from breaking the assumption that len(chr(i)) == 1? > * if the methods would pad with half the number of surroagtes > to make sure that len(str.center(n)) == n, the resulting > output to e.g. a terminal would be further off, than what > you already have with surrogates and combining code points > in the original string. > I agree again. What I suggested on the tracker, supporting non-BMP characters in narrow builds should mean that library functions given input with the same UCS-4 encoding should produce output with the same UCS-4 encoding. > Perhaps it's time to reconsider a project I once started > but that never got off the ground: > > http://mail.python.org/pipermail/python-dev/2008-July/080911.html > > Here's the pre-PEP: > > http://mail.python.org/pipermail/python-dev/2001-July/015938.html I agree again, but I feel that exposing code units rather than code points at the Python string level takes us back to 2.x days of mixing bytes and strings. Let me quote Guido circa 2001 again: """ ... if we had wanted to use a variable-lenth internal representation, we should have picked UTF-8 way back, like Perl did. Moving to a UTF-16-based internal representation now will give us all the problems of the Perl choice without any of the benefits. """ I don't understand what changed since 2001 that made this argument invalid. I note that an opinion has been raised on this thread that if we want compressed internal representation for strings, we should use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as too hard to implement. What makes UTF-16 easier than UTF-8? Only the fact that you can ignore bugs longer, in my view.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4