M.-A. Lemburg wrote: > Martin v. Löwis wrote: > >> Walter Dörwald wrote: >> >>> OK, let's come up with a patch that fixes the incomplete byte >>> sequences problem and then discuss non-stream APIs. >>> >>> So, what should the next step be? >> >> I think your first patch should be taken as a basis for that. > > We do need a way to communicate state between the codec > and Python. > > However, I don't like the way that the patch > implements this state handling: I think we should use a > generic "state" object here which is passed to the stateful > codec and returned together with the standard return values > on output: > > def decode_stateful(data, state=None): > ... decode and modify state ... > return (decoded_data, length_consumed, state) Another option might be that the decode function changes the state object in place. > where the object type and contents of the state variable > is defined per codec (e.g. could be a tuple, just a single > integer or some other special object). If a tuple is passed and returned this makes it possible from Python code to mangle the state. IMHO this should be avoided if possible. > Otherwise we'll end up having different interface > signatures for all codecs and extending them to accomodate > for future enhancement will become unfeasable without > introducing yet another set of APIs. We already have slightly different decoding functions: utf_16_ex_decode() takes additional arguments. > Let's discuss this some more and implement it for Python 2.5. > For Python 2.4, I think we can get away with what we already > have: > [...] OK, I've updated the patch. > [...] > The buffer logic should only be used for streams > that do not support the interface to push back already > read bytes (e.g. .unread()). > > From a design perspective, keeping read data inside the > codec is the wrong thing to do, simply because it leaves > the input stream in an undefined state in case of an error > and there's no way to correlate the stream's read position > to the location of the error. > > With a pushback method on the stream, all the stream > data will be stored on the stream, not the codec, so > the above would no longer be a problem. On the other hand this requires special stream. Data already read is part of the codec state, so why not put it into the codec? > However, we can always add the .unread() support to the > stream codecs at a later stage, so it's probably ok > to default to the buffer logic for Python 2.4. OK. >> That still leaves the issue >> of the last read operation, which I'm tempted to leave unresolved >> for Python 2.4. No matter what the solution is, it would likely >> require changes to all codecs, which is not good. > > We could have a method on the codec which checks whether > the codec buffer or the stream still has pending data > left. Using this method is an application scope consideration, > not a codec issue. But this mean that the normal error handling can't be used for those trailing bytes. Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4