M.-A. Lemburg wrote: >>> [...] >>> def decode_stateful(data, state=None): >>> ... decode and modify state ... >>> return (decoded_data, length_consumed, state) >> >> Another option might be that the decode function changes >> the state object in place. > > Good idea. But that's totally up to the implementor. >>> where the object type and contents of the state variable >>> is defined per codec (e.g. could be a tuple, just a single >>> integer or some other special object). >> >> If a tuple is passed and returned this makes it possible >> from Python code to mangle the state. IMHO this should be >> avoided if possible. > >>> Otherwise we'll end up having different interface >>> signatures for all codecs and extending them to accomodate >>> for future enhancement will become unfeasable without >>> introducing yet another set of APIs. >> >> We already have slightly different decoding functions: >> utf_16_ex_decode() takes additional arguments. > > Right - it was a step in the wrong direction. Let's not > use a different path for the future. utf_16_ex_decode() serves a purpose: it help implement the UTF16 decoder, which has to switch to UTF-16-BE or UTF-16-LE according to the BOM, so utf_16_ex_decode() needs a way to comunicate that back to the caller. >>> Let's discuss this some more and implement it for Python 2.5. >>> For Python 2.4, I think we can get away with what we already >>> have: >> >> > [...] >> >> OK, I've updated the patch. >> >>> [...] >>> The buffer logic should only be used for streams >>> that do not support the interface to push back already >>> read bytes (e.g. .unread()). >>> >>> From a design perspective, keeping read data inside the >>> codec is the wrong thing to do, simply because it leaves >>> the input stream in an undefined state in case of an error >>> and there's no way to correlate the stream's read position >>> to the location of the error. >>> >>> With a pushback method on the stream, all the stream >>> data will be stored on the stream, not the codec, so >>> the above would no longer be a problem. >> >> On the other hand this requires special stream. Data >> already read is part of the codec state, so why not >> put it into the codec? > > Ideally, the codec should not store data, I consider the remaining undecoded bytes to be part of the codec state once the have been read from the stream. > but only > reference it. It's better to keep things well > separated which is why I think we need the .unread() > interface and eventually a queue interface to support > the feeding operation. > >>> However, we can always add the .unread() support to the >>> stream codecs at a later stage, so it's probably ok >>> to default to the buffer logic for Python 2.4. >> >> OK. >> >>>> That still leaves the issue >>>> of the last read operation, which I'm tempted to leave unresolved >>>> for Python 2.4. No matter what the solution is, it would likely >>>> require changes to all codecs, which is not good. >>> >>> We could have a method on the codec which checks whether >>> the codec buffer or the stream still has pending data >>> left. Using this method is an application scope consideration, >>> not a codec issue. >> >> But this mean that the normal error handling can't be used >> for those trailing bytes. > > Right, but then: missing data (which usually causes the trailing > bytes) is really something for the application to deal with, > e.g. by requesting more data from the user, another application > or trying to work around the problem in some way. I don't think > that the codec error handler can practically cover these > possibilities. But in many cases the user might want to use "ignore" or "replace" error handling. Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4