> Ok, let's remove the buffer API from unicode(). Should it still be > maintained for unicode(obj, encoding, errors) ? I think so yes. > Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. > It would certainly help clarify the communication between the > interpreter and the object. Would you settle for a __unicode__ method but no tp_unicode slot? It's easy enough to define a C method named __unicode__ if the need arises. This should always be tried first, not just for classic instances. Adding a slot is a bit painful now that there are so many new slots already (adding it to the end means you have to add tons of zeros, adding it to the middle means I have to edit every file). > > To convert one of these to Unicode given an encoding, shouldn't their > > decode() method be used? > > Right... perhaps we don't need __unicode__ after all: the .decode() > method already provides this functionality (on strings at least). So maybe we should deprecate unicode(obj, encoding[, error]) and recommend obj.decode(encoding[, error]) instead. But this means that objects with a buffer API but no decode() method cannot efficiently be decoded. That's what unicode(obj, encoding[, error]) was good for. To decide, we need to know how useful it is in practice to be able to decode buffers -- I doubt it is very useful, since most types supporting the buffer API are not text but raw data like memory-mapped files, arrays, PIL images. > > Really, this is such an incredible morass of APIs that I wonder if we > > shouldn't start over... There are altogether too many places in the > > code where PyUnicode_Check() is used. I wish there was a better > > way... > > Ideally, we'd need a new base class for strings and then have 8-bit > and Unicode be subclasses of the this base class. There are several > problems with this approach though; one certainly being the different > memory allocation mechanisms used (strings store the value in the > object, Unicode references an external buffer), the other > being the different nature: strings don't carry meta-information > while Unicode is in many ways restricted in use. I've thought of defining an abstract base class "string" from which both str and unicode derive. Since str and unicode don't share representation, they shouldn't share implementation, but they could still share interface. Certainly conceptually this is how we think of strings. Useless thought: the string class would have unbound methods that are almost the same as the functions defined in the string module, e.g. string.split(s) and string.strip(s) could be made to call s.split() and s.strip(), just like the module. The class could have data attributes for string.whitespace etc. But string.join() would have a different signature: the class method is join(s, list) while the function is join(list, s). So we can't quite make the module an alias for the class. :-( > I would like to boil this down to one API if possible which then > implements unicode(obj) and unicode(obj, encoding, errors) -- if > no encoding is given the semantics of PyObject_Str() are closely > followed, with encoding the semantics of PyUnicode_FromEncodedObject() > as it was are used (with the buffer interface logic removed). I would actually recommend using two different C level APIs: PyObject_Unicode() to implement unicode(obj), which should follow str(obj), and PyUnicode_FromEncodedObject() to implement unicode(obj, decoding[, error]), which should use the buffer API on obj. > In a first step, I'd use the tp_str/__str__ for unicode(obj) as > well. Later we can add a tp_unicode/__unicode__ lookup before > trying tp_str/__str__ as fallback. I would add __unicode__ support without tp_unicode right away. I would use tp_str without even looking at __str__. > If this sounds reasonable, I'll give it a go... Yes. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4