Oren Tirosh <oren-py-d@hishome.net> writes: > > If you look at the patch, you see that it precisely does what you > > propose to do: add a callback to the charmap codec: > > > > - it deletes charmap_decoding_error > > - it adds state to feed the callback function > > - it replaces the old call to charmap_decoding_error by > > But it's NOT an error. It's new encoding functionality. What is not an error? The handling? Certainly: the error and the error handler are different things; error handlers are not errors. "ignore" and "replace" are not errors, either, they are also new encoding functionality. That is the very nature of handlers: they add functionality. > The real problem was some missing functionality in codecs. Here are two > approaches to solve the problem: > > 1. Add the missing functionality. That is not feasible, since you want that functionality also for codecs you haven't heard of. > 2. Keep the old, limited functionality, let it fail, catch the error, > re-use an argument originally intended for an error handling strategy to > shoehorn a callback that can implement the missing functionality, add a new > name-based registry to overcome the fact that the argument must be a string. That is possible, but inefficient. It is also the approach that people use today, and the reason for this PEP to exist. The current UnicodeError does not report any detail on the state that the codec was in. > Since this approach is conceptually stuck on treating it as an error it > actually creates and discards a new exception object for each character > converted via this path. It's worth: If you find that the entire string cannot be encoded, you have typically two choices: - you perform a binary search. That may cause log n exceptions. - you encode every character on its own. That reduce the number of exceptions to the number of unencodable characters, but it will also mean that the encoding is wrong for some encodings: You will always get the shift-in/shift-out sequences that your encoding may specify. On decoding, this is worse: feeding a byte at a time may fail altogether if you happen to break a multibyte character - when feeding the entire string happily consumes long sequences of characters, and only runs into a single problem byte. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4