A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://mail.python.org/pipermail/python-dev/2010-June/101058.html below:

[Python-Dev] bytes

[Python-Dev] bytes / unicodeP.J. Eby pje at telecommunity.com
Fri Jun 25 15:07:46 CEST 2010
At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote:
>P.J. Eby writes:
>
>  > This doesn't have to be in the functions; it can be in the
>  > *types*.  Mixed-type string operations have to do type checking and
>  > upcasting already, but if the protocol were open, you could make an
>  > encoded-bytes type that would handle the error checking.
>
>Don't you realize that "encoded-bytes" is equivalent to use of a very
>limited profile of ISO 2022 coding extensions?  Such as Emacs/MULE
>internal encoding or TRON code?  It has been tried.  It does not work.
>
>I understand how types can do such checking; my point is that the
>encoded-bytes type doesn't have enough information to do it in the
>cases where you think it is better than converting to str.  There are
>*no useful operations* that can be done on two encoded-bytes with
>different encodings unless you know the ultimate target codec.

I do know the ultimate target codec -- that's the point.

IOW, I want to be able to do to all my operations by passing 
target-encoded strings to polymorphic functions.  Then, the moment 
something creeps in that won't go to the target codec, I'll be able 
to track down the hole in the legacy code that's letting bad data creep in.


>   The
>only sensible way to define the concatenation of ('ascii', 'English')
>with ('euc-jp','ÆüËܸì') is something like ('ascii', 'English',
>'euc-jp','ÆüËܸì'), and *not* ('euc-jp','EnglishÆüËܸì'), because you
>don't know that the ultimate target codec is 'euc-jp'-compatible.
>Worse, you need to build in all the information about which codecs are
>mutually compatible into the encoded-bytes type.  For example, if the
>ultimate target is known to be 'shift_jis', it's trivially compatible
>with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't
>have.

The interaction won't be with other encoded bytes, it'll be with 
other *unicode* strings.  Ones coming from other code, and literals 
embedded in the stdlib.



>No, the problem is not with the Unicode, it is with the code that
>allows characters not encodable with the target codec.

And which code that is, precisely, is the thing that may be very 
difficult to find, unless I can identify it at the first point it 
enters (and corrupts) my output data.  When dealing with a large code 
base, this may be a nontrivial problem.

More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4