I'm busy absorbing all the great feedback I got on the Unicode how-to. Thanks to all who've responded. As Aahz has said, the best way to lear= n something isn't to ask questions, it's to post an incorrect program (or= in this case, text). After reading Simon's Perl/Unicode course notes and Marc-Andr=E9's Python/Unicode EuroPython slides, I formed a simple, seemingly obvious,= hypothesis: When considering just ASCII data, plain Python strings should be mo= re space efficient than Unicode strings. I compared ps output for two interactive sessions. In the first, I exe= cuted this statement at the interpreter prompt: l =3D [u"abc%d"%i for i in xrange(1000000)] In the second I executed this similar statement: l =3D ["abc%d"%i for i in xrange(1000000)] Ps showed that the interpreter consumed 57MB or so of virtual memory fo= r the list of Unicode strings case, and a whopping 152MB for the list of plai= n strings case. Just to be sure I wasn't dreaming, I repeated the crude experiment. Same result. I then looked at the typedefs for Unicode an= d string objects. The sizes of the two structs are approximately the sam= e. There's certainly not a factor of three difference in the per-object overhead. I expect the raw Unicode buffer to refer to a chunk of memor= y that is roughly two times the size of the plain string version of the b= ytes because the internal representation is (I seem to recall from MAL's not= es) UCS2. It seemed the only thing that might be a problem was string interning, so based on the comment in stringobject.h about interning st= rings that "look like" Python identifiers, I tried one more time with strings= that didn't look like identifers (the automatic string interning would only happen for string literals anyway, right?): l =3D ["-bc%d"%i for i in xrange(1000000)] Same result. I ran these tests with a two-week old build from CVS. I just tried it = with a build from today using xrange(100000) and got similar, though obvious= ly smaller virtual memory sizes. I must be missing something obvious, but what is it? Something about pymalloc? Skip
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4