On 28 Aug 2014, at 19:54, Glenn Linderman wrote: > On 8/28/2014 10:41 AM, R. David Murray wrote: >> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman >> <v+python at g.nevcal.com> wrote: >> [...] >> Also for >> cases where the data stream is *supposed* to be in a given encoding, >> but >> contains undecodable bytes. Showing the stuff that incorrectly >> decodes >> as whatever it decodes to is generally what you want in that case. > Sure, people can learn to recognize mojibake for what it is, and maybe > even learn to recognize it for what it was intended to be, in limited > domains. But suppressing/replacing the surrogates doesn't help with > that... would it not be better to replace the surrogates with an > escape sequence that shows the original, undecodable, byte value? > Like \xNN ? For that we could extend the "backslashreplace" codec error callback, so that it can be used for decoding too, not just for encoding. I.e. b"a\xffb".decode("utf-8", "backslashreplace") would return "a\\xffb" Servus, Walter
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4