Martin v. Löwis wrote: >> If the bytes are mapped to single half surrogate codes instead of the >> normal pairs (low+high), then I can see that decoding could never be >> ambiguous and encoding could produce the original bytes. > > I was confused by Markus Kuhn's original UTF-8b specification. I have > now changed the PEP to avoid using PUA characters at all. > I find the PEP easier to understand now. In detail I'd say that if a sequence of bytes >=0x80 is found which is not valid UTF-8, then the first byte is mapped to a half surrogate and then decoding is continued from the next byte. The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway! As for handling this case, you could either: 1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2. Anyway, +1 from me.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4