On 8/28/2014 12:30 AM, MRAB wrote: > On 2014-08-28 05:56, Glenn Linderman wrote: >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote: >>> Glenn Linderman writes: >>> > On 8/26/2014 4:31 AM, MRAB wrote: >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >>> > >> Nick Coghlan writes: >>> >>> > > How about: >>> > > >>> > > replace_surrogate_escapes(s, replacement='\uFFFD') >>> > > >>> > > If you want them removed, just pass an empty string as the >>> > > replacement. >>> >>> That seems better to me (I had too much C for breakfast, I think). >>> >>> > And further, replacement could be a vector of 128 characters, to do >>> > immediate transcoding, >>> >>> Using what encoding? >> >> The vector would contain the transcoding. Each lone surrogate would map >> to a character in the vector. >> >>> If you knew that much, why didn't you use >>> (write, if necessary) an appropriate codec? I can't envision this >>> being useful. >> >> If the data format describes its encoding, possibly containing data from >> several encodings in various spots, then perhaps it is best read as >> binary, and processed as binary until those definitions are found. >> >> But an alternative would be to read with surrogate escapes, and then >> when the encoding is determined, to transcode the data. Previously, a >> proposal was made to reverse the surrogate escapes to the original >> bytes, and then apply the (now known) appropriate codec. There are not >> appropriate codecs that can convert directly from surrogate escapes to >> the desired end result. This technique could be used instead, for >> single-byte, non-escaped encodings. On the other hand, writing specialty >> codecs for the purpose would be more general. >> > There'll be a surrogate escape if a byte couldn't be decoded, but just > because a byte could be decoded, it doesn't mean that it's correct. > > If you picked the wrong encoding, the other codepoints could be wrong > too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140828/a0723244/attachment.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4