RetroSearch Browse

Tue Apr 5 22:05:14 CEST 2005 · https://mail.python.org/pipermail/python-dev/2005-April/052535.html

Walter Dörwald wrote:
> The stateful decoder has a little problem: At least three bytes
> have to be available from the stream until the StreamReader
> decides whether these bytes are a BOM that has to be skipped.
> This means that if the file only contains "ab", the user will
> never see these two characters.

This can be improved, of course: If the first byte is "a", it most
definitely is *not* an UTF-8 signature.

So we only need a second byte for the characters between U+F000
and U+FFFF, and a third byte only for the characters
U+FEC0...U+FEFF. But with the first byte being  \xef, we need
three bytes *anyway*, so we can always decide with the first
byte only whether we need to wait for three bytes.

> A solution for this would be to add an argument named final to
> the decode and read methods that tells the decoder that the
> stream has ended and the remaining buffered bytes have to be
> handled now.

Shouldn't an empty read from the underlying stream be taken
as an EOF?

Regards,
Martin

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2005-April/052535.html below:

[Python-Dev] Unicode byte order mark decoding