RetroSearch Browse

Sat Jun 26 11:34:56 CEST 2010 · https://mail.python.org/pipermail/python-dev/2010-June/101108.html

Ian Bicking, 26.06.2010 00:26:
> On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum wrote:
>> On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
>>> I'd like a version of 'decode' which would give me a type that was, in
>> every
>>> respect, unicode, and responded to all protocols exactly as other
>>> unicode objects (or "str objects", if you prefer py3 nomenclature ;-))
>> do,
>>> but wouldn't actually copy any of that memory unless it really needed to
>>> (for example, to pass to a C API that expected native wide characters),
>> and
>>> that would hold on to the original bytes so that it could produce them on
>>> demand if encoded to the same encoding again. So, as others in this
>> thread
>>> have mentioned, the 'ABC' really implies some stuff about C APIs as well.

Well, there's the buffer API, so you can already create something that 
refers to an existing C buffer. However, with respect to a string, you will 
have to make sure the underlying buffer doesn't get freed while the string 
is still in use. That will be hard and sometimes impossible to do at the 
C-API level, even if the string is allowed to keep a reference to something 
that holds the buffer.

At least in lxml, such a feature would be completely worthless, as text is 
never held by any ref-counted Python wrapper object. It's only part of the 
XML tree, which is allowed to change at (more or less) any time, so the 
underlying char* buffer could just get freed without further notice. Adding 
a guard against that would likely have a larger impact on the performance 
than the decoding operations.

>>> I'm not sure about the exact performance impact of such a class, which is
>>> why I'd like the ability to implement it *outside* of the stdlib and see
>> how
>>> it works on a project, and return with a proposal along with some data.
>>>   There are also different ways to implement this, and other optimizations
>>> (like ropes) which might be better.
>>> You can almost do this today, but the lack of things like the
>> hypothetical
>>> "__rcontains__" does make it impossible to be totally transparent about
>> it.
>>
>> But you'd still have to validate it, right? You wouldn't want to go on
>> using what you thought was wrapped UTF-8 if it wasn't actually valid
>> UTF-8 (or you'd be worse off than in Python 2). So you're really just
>> worried about space consumption. I'd like to see a lot of hard memory
>> profiling data before I got overly worried about that.
>
> It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically
> benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and
> found that using Python 2 strs in some cases provided notable advantages.  I
> know Stefan copied ElementTree in this regard in lxml, maybe he also did a
> benchmark or knows of one?

Actually, bytes vs. unicode doesn't make that a big difference in Py2 for 
lxml. ElementTree is a lot older, so I guess it made a larger difference 
when its code was written (and I even think I recall seeing numbers for 
lxml where it seemed to make a notable difference).

In lxml, text content is stored in the C tree of libxml2 as UTF-8 encoded 
char* text. On request, lxml creates a string object from it and returns 
it. In Py2, it checks for plain ASCII content first and returns a byte 
string for that. Only non-ASCII strings are returned as decoded unicode 
strings. In Py3, it always returns unicode strings.

When I run a little benchmark on lxml in Py2.6.5 that just reads some short 
text content from an Element object, I only see a tiny difference between 
unicode strings and byte strings. The gap obviously increases when the text 
gets longer, e.g. when I serialise the complete text content of an XML 
document to either a byte string or a unicode string. But even for 
documents in the megabyte range we are still talking about single 
milliseconds here, and the difference stays well below 10%. It's seriously 
hard to make that the performance bottleneck in an XML application.

Also, since the string objects are only instantiated at request, memory 
isn't an issue either. That's different for (c)ElementTree again, where 
string content is stored as Python objects. Four times the size even for 
plain ASCII strings (e.g. numbers, IDs or even trailing whitespace!) can 
well become a problem there, and can easily dominate the overall size of 
the in-memory tree. Plain ASCII content is surprisingly common in XML 
documents.

Stefan

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2010-June/101108.html below:

[Python-Dev] thoughts on the bytes/string discussion