Neil Hodgson wrote: > > Paul Prescod: > <PEP: 261> > > The problem I have with this PEP is that it is a compile time option > which makes it hard to work with both 32 bit and 16 bit strings in one > program. Can not the 32 bit string type be introduced as an additional type? The two solutions are not mutually exclusive. If you (or someone) supplies a 32-bit type and Guido accepts it, then the compile option might fall into disuse. But this solution was chosen because it is much less work. Really though, I think that having 16-bit and 32-bit types is extra confusion for very little gain. I would much rather have a single space-efficient type that hid the details of its implementation. But nobody has volunteered to code it and Guido might not accept it even if someone did. >... > This wasn't usefully true in the past for DBCS strings and is not the > right way to think of either narrow or wide strings now. The idea that > strings are arrays of characters gets in the way of dealing with many > encodings and is the primary difficulty in localising software for Japanese. The whole benfit of moving to 32-bit character strings is to allow people to think of strings as arrays of characters. Forcing them to consider variable-length encodings is precisely what we are trying to avoid. > Iteration through the code units in a string is a problem waiting to bite > you and string APIs should encourage behaviour which is correct when faced > with variable width characters, both DBCS and UTF style. Iteration over > variable width characters should be performed in a way that preserves the > integrity of the characters. On wide Python builds there is no such thing as variable width Unicode characters. It doesn't make sense to combine two 32-bit characters to get a 64-bit one. On narrow Python builds you might want to treat a surrogate pair as a single character but I would strongly advise against it. If you want wide characters, move to a wide build. Even if a narrow build is more space efficient, you'll lose a ton of performance emulating wide characters in Python code. > ... M.-A. Lemburg's proposed set of iterators could > be extended to indicate encoding "for c in s.asCharacters('utf-8')" and to > provide for the various intended string uses such as "for c in > s.inVisualOrder()" reversing the receipt of right-to-left substrings. A floor wax and a desert topping. <0.5 wink> I don't think that the average Python programmer would want s.asCharacters('utf-8') when they already have s.decode('utf-8'). We decided a long time ago that the model for standard users would be fixed-length (1!), abstract characters. That's the way Python's Unicode subsystem has always worked. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4