A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-March/021459.html below:

[Python-Dev] PEP 263 considered faulty (for some Japanese)

[Python-Dev] PEP 263 considered faulty (for some Japanese)Martin v. Loewis martin@v.loewis.de
19 Mar 2002 21:29:12 +0100
"SUZUKI Hisao" <suzuki@acm.org> writes:

> > And TextEdit cannot save as UTF-8?
> 
> It can.  But doing so suffers from "mojibake".

You mean, it won't read it back in properly? Is that because it won't
auto-detect the encoding, or does it not even support opening files as
UTF-8? Could it be told to write a UTF-8 signature into the file?
Would that help autodetection?

> Anyway, until the stage2 comes true, you can write Japanese
> python files only in either EUC-JP or UTF-8 unless you hack up
> the interpreter, thus Python remains unsatisfactory to many
> present Japanese till the day of UTF-8.  We should either hurry
> up or wait still.

I expect that the localization patches that circulate now will
continue to apply (perhaps with minimal modifications) after stage 1
is implemented. If the patches are enhanced to do the "right thing"
(i.e. properly take into consideration the declared encoding, to
determine the end of a string), people won't notice the difference
compared to a full stage 2 implementation.

> As for UTF-16 with BOM, any text outside Unicode literals should
> be translated into UTF-8 (not UTF-16).  It is the sole logical
> consequence in that UTF-8 is strictly ASCII-compatible and able
> to map all the characters in Unicode naturally.  

Well, no. If UTF-16 is supported as an input encoding in stage 2, it
will follow the same process as any other input encoding: The byte
strings literals will be converted back to UTF-16. Any patch that
special-cases UTF-16 will be rejected.


> You will write
> source codes in UTF-16 as follows:
> 
> 	s = '<characters>'
> 	...
> 	u = unicode(s, 'utf-8')  # not utf-16!

No, that won't work. Instead, you *should* write

u = u'<characters>'

No need to call a function.


> N.B. one should write a binary (not character, but, say, image
> or audio) data literal as follows:
> 
> 	b = '\x89\xAB\xCD\xEF'

I completely agree. Binary data should use hex escapes. That will make
an interesting challenge for any stage 2 implementation, BTW: \xAB
shall denote byte 0x89 no matter what the input encoding was. So you
cannot convert \xAB to a Unicode character, and expect conversion to
the input encoding to do the right thing. Instead, you must apply the
conversion to the source encoding only for the unescaped characters.

People had been proposing to introduce b'' strings for binary data, to
allow to switch 'plain' strings to denote Unicode strings at some
point, but this is a different PEP.

Regards,
Martin



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4