On 11/24/2010 3:06 PM, Alexander Belopolsky wrote: > Any non-trivial text processing is likely to be broken in presence of > surrogates. Producing them on input is just trading known issue for > an unknown one. Processing surrogate pairs in python code is hard. > Software that has to support non-BMP characters will most likely be > written for a wide build and contain subtle bugs when run under a > narrow build. Note that my latest proposal does not abolish > surrogates outright. Users who want them can still use something like > "surrogateescape" error handler for non-BMP characters. It seems to me that what you are asking for is an alternate, optional, utf-8-bmp codec that would raise an error, in either direction, for non-bmp chars. Then, as you suggest, if one is not prepared for surrogates, they are not allowed. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4