Showing content from http://mail.python.org/pipermail/python-dev/attachments/20110128/d8200947/attachment-0001.html below:
Pardon me for this drive-by posting, but this thread smells a lot like this old thread (don't be afraid to read it all, there are some good points in there; not directed at you Martin, but at all readers/posters in this thread)...<div>
<br></div><div><a href="http://mail.python.org/pipermail/python-3000/2006-September/003795.html">http://mail.python.org/pipermail/python-3000/2006-September/003795.html</a></div><div><br></div><div><a href="http://mail.python.org/pipermail/python-3000/2006-September/003795.html"></a>I'm not averse to faster and/or more memory efficient unicode representations (I would be quite happy with them, actually). I do see the usefulness of having non-utf-8 representations, and caching them is a good idea, though I wonder if that is a "good for Python itself to cache", or "good for the application to cache".</div>
<div><br></div><div>The evil side of me says that we should just provide an API available in Python/C for "give me the representation of unicode string X using the 2byte/4byte code points", and have it just return the appropriate array.array() value (useful for passing to other APIs, or for those who need to do manual manipulation of code-points), or whatever structure is deemed to be appropriate.</div>
<div><br></div><div>The less evil side of me says that going with what the PEP offers isn't a bad idea, and might just be a good idea.</div><div><br></div><div>I'll defer my vote to Martin.</div><div><br></div><div>
Regards,</div><div> - Josiah</div><div><br><div class="gmail_quote">On Mon, Jan 24, 2011 at 12:17 PM, "Martin v. Löwis" <span dir="ltr"><<a href="mailto:martin@v.loewis.de">martin@v.loewis.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">I have been thinking about Unicode representation for some time now.<br>
This was triggered, on the one hand, by discussions with Glyph Lefkowitz<br>
(who complained that his server app consumes too much memory), and Carl<br>
Friedrich Bolz (who profiled Python applications to determine that<br>
Unicode strings are among the top consumers of memory in Python).<br>
On the other hand, this was triggered by the discussion on supporting<br>
surrogates in the library better.<br>
<br>
I'd like to propose PEP 393, which takes a different approach,<br>
addressing both problems simultaneously: by getting a flexible<br>
representation (one that can be either 1, 2, or 4 bytes), we can<br>
support the full range of Unicode on all systems, but still use<br>
only one byte per character for strings that are pure ASCII (which<br>
will be the majority of strings for the majority of users).<br>
<br>
You'll find the PEP at<br>
<br>
<a href="http://www.python.org/dev/peps/pep-0393/" target="_blank">http://www.python.org/dev/peps/pep-0393/</a><br>
<br>
For convenience, I include it below.<br>
<br>
Regards,<br>
Martin<br>
<br>
PEP: 393<br>
Title: Flexible String Representation<br>
Version: $Revision: 88168 $<br>
Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $<br>
Author: Martin v. Löwis <<a href="mailto:martin@v.loewis.de">martin@v.loewis.de</a>><br>
Status: Draft<br>
Type: Standards Track<br>
Content-Type: text/x-rst<br>
Created: 24-Jan-2010<br>
Python-Version: 3.3<br>
Post-History:<br>
<br>
Abstract<br>
========<br>
<br>
The Unicode string type is changed to support multiple internal<br>
representations, depending on the character with the largest Unicode<br>
ordinal (1, 2, or 4 bytes). This will allow a space-efficient<br>
representation in common cases, but give access to full UCS-4 on all<br>
systems. For compatibility with existing APIs, several representations<br>
may exist in parallel; over time, this compatibility should be phased<br>
out.<br>
<br>
Rationale<br>
=========<br>
<br>
There are two classes of complaints about the current implementation<br>
of the unicode type: on systems only supporting UTF-16, users complain<br>
that non-BMP characters are not properly supported. On systems using<br>
UCS-4 internally (and also sometimes on systems using UCS-2), there is<br>
a complaint that Unicode strings take up too much memory - especially<br>
compared to Python 2.x, where the same code would often use ASCII<br>
strings (i.e. ASCII-encoded byte strings). With the proposed approach,<br>
ASCII-only Unicode strings will again use only one byte per character;<br>
while still allowing efficient indexing of strings containing non-BMP<br>
characters (as strings containing them will use 4 bytes per<br>
character).<br>
<br>
One problem with the approach is support for existing applications<br>
(e.g. extension modules). For compatibility, redundant representations<br>
may be computed. Applications are encouraged to phase out reliance on<br>
a specific internal representation if possible. As interaction with<br>
other libraries will often require some sort of internal<br>
representation, the specification choses UTF-8 as the recommended way<br>
of exposing strings to C code.<br>
<br>
For many strings (e.g. ASCII), multiple representations may actually<br>
share memory (e.g. the shortest form may be shared with the UTF-8 form<br>
if all characters are ASCII). With such sharing, the overhead of<br>
compatibility representations is reduced.<br>
<br>
Specification<br>
=============<br>
<br>
The Unicode object structure is changed to this definition::<br>
<br>
typedef struct {<br>
PyObject_HEAD<br>
Py_ssize_t length;<br>
void *str;<br>
Py_hash_t hash;<br>
int state;<br>
Py_ssize_t utf8_length;<br>
void *utf8;<br>
Py_ssize_t wstr_length;<br>
void *wstr;<br>
} PyUnicodeObject;<br>
<br>
These fields have the following interpretations:<br>
<br>
- length: number of code points in the string (result of sq_length)<br>
- str: shortest-form representation of the unicode string; the lower<br>
two bits of the pointer indicate the specific form:<br>
01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);<br>
00 => null pointer<br>
<br>
The string is null-terminated (in its respective representation).<br>
- hash, state: same as in Python 3.2<br>
- utf8_length, utf8: UTF-8 representation (null-terminated)<br>
- wstr_length, wstr: representation in platform's wchar_t<br>
(null-terminated). If wchar_t is 16-bit, this form may use surrogate<br>
pairs (in which cast wstr_length differs form length).<br>
<br>
All three representations are optional, although the str form is<br>
considered the canonical representation which can be absent only<br>
while the string is being created.<br>
<br>
The Py_UNICODE type is still supported but deprecated. It is always<br>
defined as a typedef for wchar_t, so the wstr representation can double<br>
as Py_UNICODE representation.<br>
<br>
The str and utf8 pointers point to the same memory if the string uses<br>
only ASCII characters (using only Latin-1 is not sufficient). The str<br>
and wstr pointers point to the same memory if the string happens to<br>
fit exactly to the wchar_t type of the platform (i.e. uses some<br>
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some<br>
non-BMP characters if sizeof(wchar_t) is 4).<br>
<br>
If the string is created directly with the canonical representation<br>
(see below), this representation doesn't take a separate memory block,<br>
but is allocated right after the PyUnicodeObject struct.<br>
<br>
String Creation<br>
---------------<br>
<br>
The recommended way to create a Unicode object is to use the function<br>
PyUnicode_New::<br>
<br>
PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);<br>
<br>
Both parameters must denote the eventual size/range of the strings.<br>
In particular, codecs using this API must compute both the number of<br>
characters and the maximum character in advance. An string is<br>
allocated according to the specified size and character range and is<br>
null-terminated; the actual characters in it may be unitialized.<br>
<br>
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported<br>
for processing UTF-8 input; the input is decoded, and the UTF-8<br>
representation is not yet set for the string.<br>
<br>
PyUnicode_FromUnicode remains supported but is deprecated. If the<br>
Py_UNICODE pointer is non-null, the str representation is set. If the<br>
pointer is NULL, a properly-sized wstr representation is allocated,<br>
which can be modified until PyUnicode_Finalize() is called (explicitly<br>
or implicitly). Resizing a Unicode string remains possible until it<br>
is finalized.<br>
<br>
PyUnicode_Finalize() converts a string containing only a wstr<br>
representation into the canonical representation. Unless wstr and str<br>
can share the memory, the wstr representation is discarded after the<br>
conversion.<br>
<br>
String Access<br>
-------------<br>
<br>
The canonical representation can be accessed using two macros<br>
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the<br>
value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE<br>
(3). PyUnicode_Data gives the void pointer to the data, masking out<br>
the pointer kind. All these functions call PyUnicode_Finalize<br>
in case the canonical representation hasn't been computed yet.<br>
<br>
A new function PyUnicode_AsUTF8 is provided to access the UTF-8<br>
representation. It is thus identical to the existing<br>
_PyUnicode_AsString, which is removed. The function will compute the<br>
utf8 representation when first called. Since this representation will<br>
consume memory until the string object is released, applications<br>
should use the existing PyUnicode_AsUTF8String where possible<br>
(which generates a new string object every time). API that implicitly<br>
converts a string to a char* (such as the ParseTuple functions) will<br>
use this function to compute a conversion.<br>
<br>
PyUnicode_AsUnicode is deprecated; it computes the wstr representation<br>
on first use.<br>
<br>
String Operations<br>
-----------------<br>
<br>
Various convenience functions will be provided to deal with the<br>
canonical representation, in particular with respect to concatenation<br>
and slicing.<br>
<br>
Stable ABI<br>
----------<br>
<br>
None of the functions in this PEP become part of the stable ABI.<br>
<br>
Copyright<br>
=========<br>
<br>
This document has been placed in the public domain.<br>
_______________________________________________<br>
Python-Dev mailing list<br>
<a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a><br>
<a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com" target="_blank">http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com</a><br>
</blockquote></div><br></div>
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4