Tim Peters wrote: > > [MAL] > > I wonder how we could add %-formatting to Unicode strings without > > duplicating the PyString_Format() logic. > > > > First, do we need Unicode object %-formatting at all ? > > Sure -- in the end, all the world speaks Unicode natively and encodings > become historical baggage. Granted I won't live that long, but I may last > long enough to see encodings become almost purely an I/O hassle, with all > computation done in Unicode. > > > Second, here is an emulation using strings and <default encoding> > > that should give an idea of one could work with the different > > encodings: > > > > s = '%s %i abcäöü' # a Latin-1 encoded string > > t = (u,3) > > What's u? A Unicode object? Another Latin-1 string? A default-encoded > string? How does the following know the difference? u refers to a Unicode object in the proposal. Sorry, forgot to mention that. > > # Convert Latin-1 s to a <default encoding> string via Unicode > > s1 = unicode(s,'latin-1').encode() > > > > # The '%s' will now add u in <default encoding> > > s2 = s1 % t > > > > # Finally, convert the <default encoding> encoded string to Unicode > > u1 = unicode(s2) > > I don't expect this actually works: for example, change %s to %4s. > Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to > know that some (or all) characters in u consume multiple bytes, so can't > extract "the right" number of bytes from u. I think % formating has to know > the truth of what you're doing. Hmm, guess you're right... format parameters should indeed refer to characters rather than number of encoding bytes. This means a new PyUnicode_Format() implementation mapping Unicode format objects to Unicode objects. > > Note that .encode() defaults to the current setting of > > <default encoding>. > > > > Provided u maps to Latin-1, an alternative would be: > > > > u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1') > > More interesting is fmt % tuple where everything is Unicode; people can muck > with Latin-1 directly today using regular strings, so the example above > mostly shows artificial convolution. ... hmm, there is a problem there: how should the PyUnicode_Format() API deal with '%s' when it sees a Unicode object as argument ? E.g. what would you get in these cases: u = u"%s %s" % (u"abc", "abc") Perhaps we need a new marker for "insert Unicode object here". -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 45 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4