Martin v. Löwis wrote: >> We do need to extend the API between the stream codec >> and the encode/decode functions, no doubt about that. >> However, this is an extension that is well hidden from >> the user of the codec and won't break code. > > So you agree to the part of Walter's change that introduces > new C functions (PyUnicode_DecodeUTF7Stateful etc)? > > I think most of the patch can be discarded: there is no > need for .encode and .decode to take an additional argument. But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored. Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed. Maybe this should be done by StreamReader.close()? > It is only necessary that the StreamReader and StreamWriter > are stateful, and that only for a selected subset of codecs. > > Marc-Andre, if the original patch (diff.txt) was applied: > What *specific* change in that patch would break code? > What *specific* code (C or Python) would break under that > change? > > I believe the original patch can be applied as-is, and > does not cause any breakage. The first version has a broken implementation of the UTF-7 decoder. When decoding the byte sequence "+-" in two calls to decode() (i.e. pass "+" in one call and "-" in the next), no character got generated, because inShift (as a flag) couldn't remember whether characters where encountered between the "+" and the "-". Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed. > It also introduces a change > between the codec and the encode/decode functions that is > well hidden from the user of the codec. Would a version of the patch without a final argument but with a feed() method be accepted? I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface. With a feed() method in the stream reader this is rather simple: init() { PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.reader = PyObject_CallObject(reader, NULL); } int feed(char *bytes) { parse(PyObject_CallMethod(self.reader, "feed", "s", bytes); } The feed method itself is rather simple (see the second version of the patch). Without the feed method(), we need the following: 1) A StreamQueue class that a) supports writing at one end and reading at the other end b) has a method for pushing back unused bytes to be returned in the next call to read() 2) A StreamQueueWrapper class that a) gets passed the StreamReader factory in the constructor, creates a StreamQueue instance, puts it into an attribute and passes it to the StreamReader factory (which must also be put into an attribute). b) has a feed() method that calls write() on the stream queue and read() on the stream reader and returns the result Then the C implementation of the parser looks something like this: init() { PyObject *module = PyImport_ImportModule("whatever"); PyObject *wclass = PyObject_GetAttr(module, "StreamQueueWrapper"); PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.wrapper = PyObject_CallObject(wclass, reader); } int feed(char *bytes) { parse(PyObject_CallMethod(self.wrapper, "feed", "s", bytes); } I find this neither easier to implement nor easier to explain. Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4