> The only drawback I can see is if the UTF-8 bytes actually decode to a > half surrogate. However, half surrogates should really only occur in > UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 > anyway! Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is "safe". > As for handling this case, you could either: > > 1. Raise an exception (which is what you're trying to avoid) > > or: > > 2. Treat it as invalid UTF-8 and map the bytes to half surrogates > (encoding would produce the original bytes). > > I'd prefer option 2. I hadn't thought of this case, but you are right - they *are* illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4