On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald <walter at livinglogic.de> wrote: > Hye-Shik Chang wrote: > > > On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald > > <walter at livinglogic.de> wrote: > > > >>Pythons unicode machinery currently has problems when decoding > >>incomplete input. > >> > >>When codecs.StreamReader.read() encounters a decoding error it > >>reads more bytes from the input stream and retries decoding. > >>This is broken for two reasons: > >>1) The error might be due to a malformed byte sequence in the input, > >> a problem that can't be fixed by reading more bytes. > >>2) There may be no more bytes available at this time. Once more > >> data is available decoding can't continue because bytes from > >> the input stream have already been read and thrown away. > >>(sio.DecodingInputFilter has the same problems) > > > > StreamReaders and -Writers from CJK codecs are not suffering from > > this problems because they have internal buffer for keeping states > > and incomplete bytes of a sequence. In fact, CJK codecs has its > > own implementation for UTF-8 and UTF-16 on base of its multibytecodec > > system. It provides a "working" StreamReader/Writer already. :) > > Seems you had the same problems with the builtin stream readers! ;) > > BTW, how do you solve the problem that incomplete byte sequences > are retained in the middle of a stream, but should generate errors > at the end? > Rough pseudo code here: (it's written in C in CJKCodecs) class StreamReader: pending = '' # incomplete def read(self, size=-1): while True: r = fp.read(size) if self.pending: r = self.pending + r self.pending = '' if r: try: outputbuffer = r.decode('utf-8') except MBERR_TOOFEW: # incomplete multibyte sequence pass except MBERR_ILLSEQ: # illegal sequence raise UnicodeDecodeError, "illseq" if not r or size == -1: # end of the stream if r have not consumed up for the output: raise UnicodeDecodeError, "toofew" if r have not consumed up for the output: self.pending = remainders of r if (size == -1 or # one time read up len(outputbuffer) > 0 or # output buffer isn't empty original length of r == 0): # the end of the stream break size = 1 # read 1 byte in next try return outputbuffer CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions) Regards, Hye-Shik
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4