RetroSearch Browse

Fri Jun 25 23:02:05 CEST 2010 · https://mail.python.org/pipermail/python-dev/2010-June/101080.html

On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
<glyph at twistedmatrix.com> wrote:
>
> On Jun 24, 2010, at 4:59 PM, Guido van Rossum wrote:
>
> Regarding the proposal of a String ABC, I hope this isn't going to
> become a backdoor to reintroduce the Python 2 madness of allowing
> equivalency between text and bytes for *some* strings of bytes and not
> others.
>
> For my part, what I want out of a string ABC is simply the ability to do
> application-specific optimizations.
> There are many applications where all input and output is text, but _must_
> be UTF-8.  Even GTK uses UTF-8 as its native text representation, so
> "output" could just be display.
> Right now, in Python 3, the only way to be "correct" about this is to copy
> every byte of input into 4 bytes of output, then copy each code point *back*
> into a single byte of output.  If all your application does is rewrite the
> occasional XML attribute, for example, this cost can be significant, if not
> overwhelming.
> I'd like a version of 'decode' which would give me a type that was, in every
> respect, unicode, and responded to all protocols exactly as other
> unicode objects (or "str objects", if you prefer py3 nomenclature ;-)) do,
> but wouldn't actually copy any of that memory unless it really needed to
> (for example, to pass to a C API that expected native wide characters), and
> that would hold on to the original bytes so that it could produce them on
> demand if encoded to the same encoding again. So, as others in this thread
> have mentioned, the 'ABC' really implies some stuff about C APIs as well.
> I'm not sure about the exact performance impact of such a class, which is
> why I'd like the ability to implement it *outside* of the stdlib and see how
> it works on a project, and return with a proposal along with some data.
>  There are also different ways to implement this, and other optimizations
> (like ropes) which might be better.
> You can almost do this today, but the lack of things like the hypothetical
> "__rcontains__" does make it impossible to be totally transparent about it.

But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption. I'd like to see a lot of hard memory
profiling data before I got overly worried about that.

-- 
--Guido van Rossum (python.org/~guido)

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2010-June/101080.html below:

[Python-Dev] thoughts on the bytes/string discussion