The buffer interface is one of the most misunderstood parts of Python. I believe that if it were PEPped today, it would have a hard time getting accepted in its current form. There are also two different parts that are commonly referred by this name: the "buffer API", which is a C-only API, and the "buffer object", which has both a C API and a Python API. Both were largely proposed, implemented and extended by others, and I have to admit that I'm still uneasy with defending them, especially the buffer object. Both are extremely implementation-dependent (in JPython, neither makes much sense). The Buffer API -------------- The C-only buffer API was originally intended to allow efficient binary I/O from and (in some cases) to large objects that have a relatively well-understood underlying memory representation. Examples of such objects include strings, array module arrays, memory-mapped files, NumPy arrays, and PIL objects. It was created with the desire to avoid an expensive memory-copy operation when reading or writing large arrays. For example, if you have an array object containing several millions of double precision floating point numbers, and you want to dump it to a file, you might prefer to do the I/O directly from the array's memory buffer rather than first copying it to a string. (You lose portability of the data, but that's often not a problem the user cares about in these cases.) An alternative solution for this particular problem was consdered: object types in need of this kind of efficient I/O could define their own I/O methods, thereby allowing them to hide their internal representation. This was implemented in some cases (e.g. the array module has read() and write() methods) but rejected, because a simple-minded implementation of this approach would not work with "file-like" objects (e.g. StringIO files). It was deemed important that file-like objects would not place restrictions on the kind of objects that could interact with them (compared to real file objects). A possible solution would have been to require that each object implementing its own read and write methods should support both efficient I/O to/from "real" file objects and fall-back I/O to/from "file-like" objects. The fall-back I/O would have to convert the object's data to a string object which would then be passed to the write() method of the file-like object. This approach was rejected because it would make it impossible to implement an alternative file object that would be as efficient as the real file object, since large object I/O would be using the inefficient fallback interface. To address these issues, we decided to define an interface that would let I/O operations ask the objects where their data bytes are in memory, so that the I/O can go directly to/from the memory allocated by the object. This is the classic buffer API. It has a read-only and a writable variant -- the writable variant is for mutable objects that will allow I/O directly into them. Because we expected that some objects might have an internal representation distributed over a (small) number of separately allocated pieces of memory, we also added the getsegcount() API. All objects that I know support the buffer API return a segment count of 1, and most places that use the buffer API give up if the segment count is larger; so this may be considered as an unnecessary generalization (and source of complexity). The buffer API has found significant use in a way that wasn't originally intended: as a sort of informal common base class for string-like objects in situations where a char[] or char* type must be passed (in a read-only fashion) to C code. This is in fact the most common use of the buffer API now, and appears to be the reason why the segment count must typically be 1. In connection with this, the buffer API has grown a distinction between character and binary buffers (on the read-only end only). This may have been a mistake; it was intended to help with Unicode but it ended up not being used. The Buffer Object ----------------- The buffer object has a much less clear reason for its existence. When Greg Stein first proposed it, he wrote: The intent of this type is to expose a string-like interface from an object that supports the buffer interface (without making a copy). In addition, it is intended to support slices of the target object. My eventual goal here is to tweak the file object to support memory mapping and the buffer interface. The buffer object can then return slices of the file without making a new copy. Next step: change marshal.c, ceval.c, and compile.c to support a buffer for the co_code attribute. Net result is that copies of code streams don't need to be copied onto the heap, but can be left in an mmap'd file or a frozen file. I'm hoping there will be some perf gains (time and memory). Even without some of the co_code work, enabling mmap'd files and buffers onto them should be very useful. I can probably rattle off a good number of other uses for the buffer type. I don't think that any of these benefits have been realized yet, and altogether I think that the buffer object causes a lot of confusion. The buffer *API* doesn't guarantee enough about the lifetime of the pointers for the buffer *object* to be able to safely preserve those pointers, even if the buffer object holds on to the base object. (The C-level buffer API informally guarantees that the data remains valid only until you do anything to the base object; this is usually fine as long as you don't release the global interpreter lock.) The buffer object's approach to implementing the various sequence operations is strange: sometimes it behaves like a string, sometimes it doesn't. E.g. a slice returns a new string object unless it happens to address the whole buffer, in which case it returns a reference to the existing buffer object. It would seem more logical that a subslice would return a new buffer object. Concatenation and repetition of buffer objects are likewise implemented inconsistently; it would have been more consistent with the intended purpose if these weren't supported at all (i.e. if none of the buffer object operations would allocate new memory except for buffer object headers). I would have concluded that the buffer object is entirely useless, if it weren't for some very light use that is being made of it by the Unicode machinery. I can't quite tell whether that was done just because it was convenient, or whether that shows there is a real need. What Now? --------- I'm not convinced that we need the buffer object at all. For example, the mmap module defines a sequence object so doesn't seem to need the buffer object to help it support slices. Regarding the buffer API, it's clearly useful, although I'm not convinced that it needs the multiple segment count option or the char vs. binary buffer distinction, given that we're not using this for Unicode objects as we originally planned. I also feel that it would be helpful if there was an explicit way to lock and unlock the data, so that a file object can release the global interpreter lock while it is doing the I/O. But that's not a high priority (and there are no *actual* problems caused by the lack of such an API -- just *theoretical*). For Python 3000, I think I'd like to rethink this whole mess. Perhaps byte buffers and character strings should be different beasts, and maybe character strings could have Unicode and 8-bit subclasses (and maybe other subclasses that explicitly know about their encoding). And maybe we'd have a real file base class. And so on. What to do in the short run? I'm still for severely simplifing the buffer object (ripping out the unused operations) and deprecating it. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4