A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2004-July/046494.html below:

[Python-Dev] Decoding incomplete unicode

[Python-Dev] Decoding incomplete unicodeWalter Dörwald walter at livinglogic.de
Wed Jul 28 17:13:17 CEST 2004
Hye-Shik Chang wrote:

> [...]
>>BTW, how do you solve the problem that incomplete byte sequences
>>are retained in the middle of a stream, but should generate errors
>>at the end?
> 
> Rough pseudo code here: (it's written in C in CJKCodecs)
> 
> class StreamReader:
> 
>     pending = '' # incomplete 
> 
>     def read(self, size=-1):
>         while True:
>             r = fp.read(size)
>             if self.pending:
>                 r = self.pending + r
>                 self.pending = ''
> 
>             if r:
>                 try:
>                     outputbuffer = r.decode('utf-8')
>                 except MBERR_TOOFEW: # incomplete multibyte sequence
>                     pass
>                 except MBERR_ILLSEQ: # illegal sequence
>                     raise UnicodeDecodeError, "illseq"
> 
>             if not r or size == -1: # end of the stream
>                 if r have not consumed up for the output:
>                     raise UnicodeDecodeError, "toofew"

Here's the problem: I'd like the streamreader to be able
to continue even when there is no input available *now*.
Perhaps there should be an additional argument to read()
named final? If final is true, the stream reader makes
sure that all pending bytes have been used up.

>             if r have not consumed up for the output:
>                 self.pending = remainders of r
> 
>             if (size == -1 or               # one time read up
>                 len(outputbuffer) > 0 or    # output buffer isn't empty
>                 original length of r == 0): # the end of the stream
>                     break
> 
>             size = 1 # read 1 byte in next try
> 
>         return outputbuffer
> 
> 
> CJKcodecs' multibytecodec structure has distinguished internal error
> codes for "illegal sequence" and "incomplete sequence".  And each
> internal codecs receive a flag that indicates if immediate flush
> is needed at time.  (for the end of streams and simple decode functions)

Bye,
    Walter Dörwald


More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4