On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote: > >>>>> "Gareth" == Gareth McCaughan <gmccaughan at synaptics-uk.com> writes: > > Gareth> That said, I strongly agree that all textual data should > Gareth> be Unicode as far as the developer is concerned; but, at > Gareth> least in the USA :-), it makes sense to have an optimized > Gareth> representation that saves space for ASCII-only text, just > Gareth> as we have an optimized representation for small integers. > > This is _not at all_ obvious. As MAL just pointed out, if efficiency > is a goal, text algorithms often need to be different for operations > on texts that are dense in an 8-bit character space, vs texts that are > sparse in a 16-bit or 20-bit character space. Note that that is what > </F> is talking about too; he points to SRE and ElementTree. I hope you aren't expecting me to disagree. > When viewed from that point of view, the subtext to </F>'s comment is > "I don't want to separately maintain 8-bit versions of new text > facilities to support my non-Unicode applications, I want to impose > that burden on the authors of text-handling PEPs." That may very well > be the best thing for Python; as </F> has done a lot of Unicode > implementation for Python, he's in a good position to make such > judgements. But the development costs MAL refers to are bigger than > you are estimating, and will continue as long as that policy does. How do you know what I am estimating? > While I'm very sympathetic to </F>'s view that there's more than one > way to skin a cat, and a good cat-handling design should account for > that, and conceding his expertise, none-the-less I don't think that > Python really wants to _maintain_ more than one text-processing system > by default. Of course if you restrict yourself to the class of ASCII- > only strings, you can do better, and of course that is a huge class of > strings. But that, as such, is important only to efficiency fanatics. No, it's important to ... well, people to whom efficiency matters. There's no need for them to be fanatics. > The question is, how often are people going to notice that when they > have pure ASCII they get a 100% speedup, or that they actually can > just suck that 3GB ASCII file into their 4GB memory, rather than > buffering it as 3 (or 6) 2GB Unicode strings? Compare how often > people are going to notice that a new facility "just works" for > Japanese or Hindi. Why is that the question, rather than "how often are people going to benefit from getting a 100% speedup when they have pure ASCII"? Or even "how often are people going to try out Python on an application that uses pure-ASCII strings, and decide to use some other language that seems to do the job much faster"? > I just don't see the former being worth the extra > effort, while the latter makes the "this or that" choice clear. If a > single representation is enough, it had better be Unicode-based, and > the others can be supported in libraries (which turn binary blobs into > non-standard text objects with appropriate methods) as the need arises. No question that if a single representation is enough then it had better be Unicode. -- g
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4