A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-March/021318.html below:

[Python-Dev] PEP 263 - default encoding

[Python-Dev] PEP 263 - default encodingMartin v. Loewis martin@v.loewis.de
18 Mar 2002 09:04:05 +0100
"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> The parser accepts programs encoded in unicode.  We provide some
> hooks to help you get from encodings convenient for your environment
> to Unicode, and some sample implementations of things to hang on the
> hooks.  But if there are problems with non-unicode files, they're
> your problems."

I still can't see how this is different from what the PEP
says. "encoded in Unicode" is, of course, a weak statement, since
Unicode is not an encoding (UTF-8 would be). With the PEP, people can
write source code in different encodings, but any problems they get
are their problems.

>   o There may be some audiences who are poorly served (Mr. Suzuki).

In phase two of the PEP, I don't think there will be large audiences
that are poorly served. If you want to write Python source in
then-unsupported encodings, people can write "hooks" to support
those. E.g. for importing modules, they just need to redefine
__import__.

>   o I think it will definitely tend to encourage use of national/
>     platform encodings rather than UTF-8 in source.  It may be hard to
>     get this sun to set.

It is traditional Python policy not to take side on political
debates. If this sun does not set, what is the problem?

>   o I think it makes it hard to implement helper tools (eg python-mode).

Harder than with those hooks? That's hard to believe. I assume you
primarily care about editors here. Editors either support multiple
encodings, or they don't. If they don't, you best write your source
code in the encoding that your editor supports, and declare that for
Python. If they do support different encodings, they may already
correctly recognize the declared encoding. If not, you may need to add
an additional declaration. Off-hand, I can't think of any editor where
this might be necessary.

>     Guido> I think even Mr. Suzuki isn't thinking of using UTF-16 in
>     Guido> his Unicode literals.  He currently sets UTF-16 as the
>     Guido> default encoding for data that he presumably reads from a
>     Guido> file.
>=20
> Well, I'm not a native Japanese.  But I have often edited English
> strings that occur in swaths of unrecognizable octets that would be
> Japanese if I had the terminal encoding set correctly.  I have also
> cut and pasted encoded Japanese into "binary" buffers.
>=20
> And how is he going to use regexps or formatting sugar without literal
> UTF-16 strings?

In stage 1 of the implementation, he can use either UTF-8 or EUC-JP in
Unicode literals. In stage 2, he can also use Shift_JIS and
iso-2022-jp.

> Right, as long as by "work" you mean "it's formally undefined but
> 8-bit clean stuff just passes through."  The problem is that people
> often do unclean things, like type ALT 1 8 5 to insert an 0xB9 octet,
> which the editor assumes is intended to be =B9 in a Latin-2 locale.
> However, if that file (which the user knows contains no Latin-2 at
> all) is read in a Latin-2 locale, and translated to Unicode, the byte
> value changes (in fact, it's no longer a byte value).  What's a parser
> to do?<wink>

I'm not sure I can follow this example. If you put byte 185 into a
Python source code file, and you declare the file as Latin-2, what
does that have to do with the locale? PEP 263 never mentions use of
the locale for anything.

> This can be made safe by not decoding the contents of ordinary string
> literals, but that requires that the parser has to do the lexing, you
> can't delegate it to a general-purpose codec.

Why is that? If the declared encoding of the file is Latin-2, the
parser will convert it into Unicode, then parse it, then reconvert
byte strings into Latin-2.

> Bingo.  And files which until that point embedded arbitrary binary
> (ie, not representing characters) stop working, quite possibly.=20=20

Breakage won't be silent, though. People will get a warning in phase
1, so they will know to declare an encoding.

If they have truly binary data in their source files (which I believe
is rare), they are advised to change those to \x escapes.

> This probably mostly works, based on mule experience.  But it requires
> the parser to have carnal knowledge of coding systems.  Isn't it
> preferable to insist on UTF-8 here, since it's simply changing the
> representation from one or two bytes back to constant-width one,
> without changing values?

It is no extra effort to support arbitrary encodings, compared to
supporting UTF-8 only. The parser just calls into the codec library,
and either gets an error or a Unicode string.

> Also, you'd have to prohibit encodings using ISO 2022 control sequences,
> as there are always many legal ways to encode the same text (there is
> no requirement that a mode-switching sequence actually be followed by
> any text before switching to a different mode), and there's no way to
> distinguish except to record the original input.

That is indeed a problem - those byte strings would have different
values at run-time. I expect that most users will accept the problem,
since the strings still have their original "meaning".

Regards,
Martin



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4