RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://mail.python.org/pipermail/python-dev/attachments/20110128/d8200947/attachment-0001.html below:

Pardon me for this drive-by posting, but this thread smells a lot like this old thread (don't be afraid to read it all, there are some good points in there; not directed at you Martin, but at all readers/posters in this thread)...<div>
 </div><div><a href="http://mail.python.org/pipermail/python-3000/2006-September/003795.html">http://mail.python.org/pipermail/python-3000/2006-September/003795.html</a></div><div> </div><div><a href="http://mail.python.org/pipermail/python-3000/2006-September/003795.html"></a>I'm not averse to faster and/or more memory efficient unicode representations (I would be quite happy with them, actually). I do see the usefulness of having non-utf-8 representations, and caching them is a good idea, though I wonder if that is a "good for Python itself to cache", or "good for the application to cache".</div>
<div> </div><div>The evil side of me says that we should just provide an API available in Python/C for "give me the representation of unicode string X using the 2byte/4byte code points", and have it just return the appropriate array.array() value (useful for passing to other APIs, or for those who need to do manual manipulation of code-points), or whatever structure is deemed to be appropriate.</div>
<div> </div><div>The less evil side of me says that going with what the PEP offers isn't a bad idea, and might just be a good idea.</div><div> </div><div>I'll defer my vote to Martin.</div><div> </div><div>
Regards,</div><div> - Josiah</div><div> <div class="gmail_quote">On Mon, Jan 24, 2011 at 12:17 PM, "Martin v. Löwis" <<a href="mailto:martin@v.loewis.de">martin@v.loewis.de</a>> wrote: 
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">I have been thinking about Unicode representation for some time now. 
This was triggered, on the one hand, by discussions with Glyph Lefkowitz 
(who complained that his server app consumes too much memory), and Carl 
Friedrich Bolz (who profiled Python applications to determine that 
Unicode strings are among the top consumers of memory in Python). 
On the other hand, this was triggered by the discussion on supporting 
surrogates in the library better. 
 
I'd like to propose PEP 393, which takes a different approach, 
addressing both problems simultaneously: by getting a flexible 
representation (one that can be either 1, 2, or 4 bytes), we can 
support the full range of Unicode on all systems, but still use 
only one byte per character for strings that are pure ASCII (which 
will be the majority of strings for the majority of users). 
 
You'll find the PEP at 
 
<a href="http://www.python.org/dev/peps/pep-0393/" target="_blank">http://www.python.org/dev/peps/pep-0393/</a> 
 
For convenience, I include it below. 
 
Regards, 
Martin 
 
PEP: 393 
Title: Flexible String Representation 
Version: $Revision: 88168 $ 
Last-Modified: $Date: 2011-01-24 21:14:21 +0100 (Mo, 24. Jan 2011) $ 
Author: Martin v. Löwis <<a href="mailto:martin@v.loewis.de">martin@v.loewis.de</a>> 
Status: Draft 
Type: Standards Track 
Content-Type: text/x-rst 
Created: 24-Jan-2010 
Python-Version: 3.3 
Post-History: 
 
Abstract 
======== 
 
The Unicode string type is changed to support multiple internal 
representations, depending on the character with the largest Unicode 
ordinal (1, 2, or 4 bytes). This will allow a space-efficient 
representation in common cases, but give access to full UCS-4 on all 
systems. For compatibility with existing APIs, several representations 
may exist in parallel; over time, this compatibility should be phased 
out. 
 
Rationale 
========= 
 
There are two classes of complaints about the current implementation 
of the unicode type: on systems only supporting UTF-16, users complain 
that non-BMP characters are not properly supported. On systems using 
UCS-4 internally (and also sometimes on systems using UCS-2), there is 
a complaint that Unicode strings take up too much memory - especially 
compared to Python 2.x, where the same code would often use ASCII 
strings (i.e. ASCII-encoded byte strings). With the proposed approach, 
ASCII-only Unicode strings will again use only one byte per character; 
while still allowing efficient indexing of strings containing non-BMP 
characters (as strings containing them will use 4 bytes per 
character). 
 
One problem with the approach is support for existing applications 
(e.g. extension modules). For compatibility, redundant representations 
may be computed. Applications are encouraged to phase out reliance on 
a specific internal representation if possible. As interaction with 
other libraries will often require some sort of internal 
representation, the specification choses UTF-8 as the recommended way 
of exposing strings to C code. 
 
For many strings (e.g. ASCII), multiple representations may actually 
share memory (e.g. the shortest form may be shared with the UTF-8 form 
if all characters are ASCII). With such sharing, the overhead of 
compatibility representations is reduced. 
 
Specification 
============= 
 
The Unicode object structure is changed to this definition:: 
 
 typedef struct { 
 PyObject_HEAD 
 Py_ssize_t length; 
 void *str; 
 Py_hash_t hash; 
 int state; 
 Py_ssize_t utf8_length; 
 void *utf8; 
 Py_ssize_t wstr_length; 
 void *wstr; 
 } PyUnicodeObject; 
 
These fields have the following interpretations: 
 
- length: number of code points in the string (result of sq_length) 
- str: shortest-form representation of the unicode string; the lower 
 two bits of the pointer indicate the specific form: 
 01 => 1 byte (Latin-1); 11 => 2 byte (UCS-2); 11 => 4 byte (UCS-4); 
 00 => null pointer 
 
 The string is null-terminated (in its respective representation). 
- hash, state: same as in Python 3.2 
- utf8_length, utf8: UTF-8 representation (null-terminated) 
- wstr_length, wstr: representation in platform's wchar_t 
 (null-terminated). If wchar_t is 16-bit, this form may use surrogate 
 pairs (in which cast wstr_length differs form length). 
 
All three representations are optional, although the str form is 
considered the canonical representation which can be absent only 
while the string is being created. 
 
The Py_UNICODE type is still supported but deprecated. It is always 
defined as a typedef for wchar_t, so the wstr representation can double 
as Py_UNICODE representation. 
 
The str and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient). The str 
and wstr pointers point to the same memory if the string happens to 
fit exactly to the wchar_t type of the platform (i.e. uses some 
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some 
non-BMP characters if sizeof(wchar_t) is 4). 
 
If the string is created directly with the canonical representation 
(see below), this representation doesn't take a separate memory block, 
but is allocated right after the PyUnicodeObject struct. 
 
String Creation 
--------------- 
 
The recommended way to create a Unicode object is to use the function 
PyUnicode_New:: 
 
 PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar); 
 
Both parameters must denote the eventual size/range of the strings. 
In particular, codecs using this API must compute both the number of 
characters and the maximum character in advance. An string is 
allocated according to the specified size and character range and is 
null-terminated; the actual characters in it may be unitialized. 
 
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported 
for processing UTF-8 input; the input is decoded, and the UTF-8 
representation is not yet set for the string. 
 
PyUnicode_FromUnicode remains supported but is deprecated. If the 
Py_UNICODE pointer is non-null, the str representation is set. If the 
pointer is NULL, a properly-sized wstr representation is allocated, 
which can be modified until PyUnicode_Finalize() is called (explicitly 
or implicitly). Resizing a Unicode string remains possible until it 
is finalized. 
 
PyUnicode_Finalize() converts a string containing only a wstr 
representation into the canonical representation. Unless wstr and str 
can share the memory, the wstr representation is discarded after the 
conversion. 
 
String Access 
------------- 
 
The canonical representation can be accessed using two macros 
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the 
value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE 
(3). PyUnicode_Data gives the void pointer to the data, masking out 
the pointer kind. All these functions call PyUnicode_Finalize 
in case the canonical representation hasn't been computed yet. 
 
A new function PyUnicode_AsUTF8 is provided to access the UTF-8 
representation. It is thus identical to the existing 
_PyUnicode_AsString, which is removed. The function will compute the 
utf8 representation when first called. Since this representation will 
consume memory until the string object is released, applications 
should use the existing PyUnicode_AsUTF8String where possible 
(which generates a new string object every time). API that implicitly 
converts a string to a char* (such as the ParseTuple functions) will 
use this function to compute a conversion. 
 
PyUnicode_AsUnicode is deprecated; it computes the wstr representation 
on first use. 
 
String Operations 
----------------- 
 
Various convenience functions will be provided to deal with the 
canonical representation, in particular with respect to concatenation 
and slicing. 
 
Stable ABI 
---------- 
 
None of the functions in this PEP become part of the stable ABI. 
 
Copyright 
========= 
 
This document has been placed in the public domain. 
_______________________________________________ 
Python-Dev mailing list 
<a href="mailto:Python-Dev@python.org">Python-Dev@python.org</a> 
<a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a> 
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com" target="_blank">http://mail.python.org/mailman/options/python-dev/josiah.carlson%40gmail.com</a> 
</blockquote></div> </div>

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4