Hye-Shik Chang wrote: > [...] >>BTW, how do you solve the problem that incomplete byte sequences >>are retained in the middle of a stream, but should generate errors >>at the end? > > Rough pseudo code here: (it's written in C in CJKCodecs) > > class StreamReader: > > pending = '' # incomplete > > def read(self, size=-1): > while True: > r = fp.read(size) > if self.pending: > r = self.pending + r > self.pending = '' > > if r: > try: > outputbuffer = r.decode('utf-8') > except MBERR_TOOFEW: # incomplete multibyte sequence > pass > except MBERR_ILLSEQ: # illegal sequence > raise UnicodeDecodeError, "illseq" > > if not r or size == -1: # end of the stream > if r have not consumed up for the output: > raise UnicodeDecodeError, "toofew" Here's the problem: I'd like the streamreader to be able to continue even when there is no input available *now*. Perhaps there should be an additional argument to read() named final? If final is true, the stream reader makes sure that all pending bytes have been used up. > if r have not consumed up for the output: > self.pending = remainders of r > > if (size == -1 or # one time read up > len(outputbuffer) > 0 or # output buffer isn't empty > original length of r == 0): # the end of the stream > break > > size = 1 # read 1 byte in next try > > return outputbuffer > > > CJKcodecs' multibytecodec structure has distinguished internal error > codes for "illegal sequence" and "incomplete sequence". And each > internal codecs receive a flag that indicates if immediate flush > is needed at time. (for the end of streams and simple decode functions) Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4