RetroSearch Browse

Mon Jun 21 20:46:57 CEST 2010 · https://mail.python.org/pipermail/python-dev/2010-June/100806.html

At 02:58 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:
>Nick alluded to the The One Obvious Way as a change in architecture.
>
>Specifically: Decode all bytes to typed objects (str, images, audio,
>structured objects) at input.  Do no manipulations on bytes ever
>except decode and encode (both to text, and to special-purpose objects
>such as images) in a program that does I/O.

This ignores the existence of use cases where what you have is text 
that can't be properly encoded in unicode.  I know, it's a hard thing 
to wrap one's head around, since on the surface it sounds like 
unicode is the programmer's savior.  Unfortunately, real-world text 
data exists which cannot be safely roundtripped to unicode, and must 
be handled in "bytes with encoding" form for certain operations.

I personally do not have to deal with this *particular* use case any 
more -- I haven't been at NTT/Verio for six years now.  But I do know 
it exists for e.g. Asian language email handling, which is where I 
first encountered it.  At the time (this *may* have changed), many 
popular email clients did not actually support unicode, so you 
couldn't necessarily just send off an email in UTF-8.  It drove us 
nuts on the project where this was involved (an i18n of an existing 
Python app), and I think we had to compromise a bit in some fashion 
(because we couldn't really avoid unicode roundtripping due to 
database issues), but the use case does actually exist.

My current needs are simpler, thank goodness.  ;-)  However, they 
*do* involve situations where I'm dealing with *other* 
encoding-restricted legacy systems, such as software for interfacing 
with the US Postal Service that only works with a restricted subset 
of latin1, while receiving mangled ASCII from an ecommerce provider, 
and storing things in what's effectively a latin-1 database.  Being 
able to easily assert what kind of bytes I've got would actually let 
me catch errors sooner, *if* those assertions were being checked when 
different kinds of strings or bytes were being combined.  i.e., at 
coercion time).

>Yes, this is tedious if you live in an ASCII world, compared to using
>bytes as characters.  However, it works for the rest of us, which the
>old style doesn't.

I'm not trying to go back to the old style -- ideally, I want 
something that would actually improve on the "it's not really 
unicode" use cases above if it were available in 2.x.

I don't want to be "encoding agnostic" or "encoding implicit", -- I 
want to make it possible to be even *more* explicit and restrictive 
than it is currently possible to be in either 2.x OR 3.x.  It's just 
that 3.x affords greater opportunity for doing this, and is an ideal 
place to make the switch -- i.e., at a point where you now have to 
get explicit about your encodings, anyway!

>As for "Think Carefully About It Every Time", that is required only in
>Porting Programs That Mix Operation On Bytes With Operation On Str.
>If you write programs from scratch, however, the decode-process-encode
>paradigm quickly becomes second nature.

Which works if and only if your outputs are truly unicode-able.  If 
you work with legacy systems (e.g. those Asian email clients and US 
postal software), you are really working with a *character set*, not 
unicode, and so putting your data in unicode form is actually *wrong* 
-- an expedient lie.

Heresy, I know, but there you go.  ;-)

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2010-June/100806.html below:

[Python-Dev] email package status in 3.X