I'm back from vacation. Comments on the thread and a list of open issues are below. Guido van Rossum wrote: > M.-A. Lemburg wrote: > > Walter has written a pretty good test suite for the patch > > and I have a good feeling about it. I'd like Walter to check > > it into CVS and then see whether the alpha tests bring up any > > quirks. The patch only touches the codecs and adds some new > > exceptions. There are no other changes involved. > > > > I think that together with PEP 263 (source code encoding) this > > is a great step forward in Python's i18n capabilities. > > > > BTW, the test script contains some examples of how to put the > > error callbacks to use: > > > > http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=27815&aid=432401 > > Sounds like a plan then. Does this mean we can check in the patch? Documentation is still missing and encoding specific decoding tests should be added to the test script. Has anybody except me and Marc-André tried the patch? On anything other than Linux/Intel? With UCS2 and UCS4? Martin v. Loewis wrote: > If you look at the large blocks of new code, you find that it is in > > - charmap_encoding_error, which insists on implementing known error > handling algorithms inline, This is done for performance reasons. > - the default error handlers, of which atleast > PyCodec_XMLCharRefReplaceErrors should be pure-Python The PyCodec_XMLCharRefReplaceErrors functionality is independent of the rest, so moving this to Python won't reduce complexity that much. And it will slow down "xmlcharrefreplace" handling for those codecs that don't implement it inline. > - PyCodec_BackslashReplaceErrors, likewise, > > - the UnicodeError exception methods (which could be omitted, IMO). Those methods were implemented so that we can easily move to new style exceptions. The exception attributes can then be members of the C struct and the accessor functions can be simple macros. I guess some of the methods could be removed by moving duplicate ones to the base class UnicodeError, but this would break backwards compatibility. Oren Tirosh wrote: > Some of my reservations about PEP 293: > > It overloads the meaning of the error handling argument in an unintuitive > way. It gets to the point where it's much more than just error handling - > it's actually extending the functionality of the codec. > > Why implement yet another name-based registry? There must be a simpler way > to do it. The registry is name-based because this is required by the current C API. Passing the error handler directly as a function object would be simpler, but this can't be done, as it would require vast changes to the C API (an old version of the patch did that.) And this way we gain the benefit of implementing well-known error hanlding names inline. It is "yet another" registry exactly because encoding and error handling are completely orthogonal (at least for encoding). If you add a new error handler all codecs can use it (as long as they are aware of the new error handling way) and if you define a new codec it will work with all existing error handlers. > Generating an exception for each character that isn't handled by simple > lookup probably adds quite a lot of overhead. 1. All encoders try to collect runs of unencodable characters to minimize the number of calls to the callback. 2. The PEP explicitely states that the codec is allowed to reuse the exception object. All codecs do this, so the exception object will only be created once (at most; when no error occurs, no exception object will be created) The exception object is just a quick way to pass information between the codec and the error handler and it could become even faster as soon as we get new style exceptions. > What are the use cases? Maybe a simple extension to charmap would be enough > for all the practical cases? Not all codecs are charmap based. Open issues: 1. For each error handler two Python function objects are created: One in the registry and a different one in the codecs module. This means that e.g. codecs.lookup_error("replace") != codecs.replace_errors We can fix that by making the name ob the Python function object globally visible or by changing the codecs init function to do a lookup and use the result or simply by removing codecs.replace_errors 2. Currently charmap encoding uses a safe way for reallocation string storage, which tests available space on each output. This slows charmap encoding down a bit. This should probably be changed back to the old way: Test available space only for output strings longer than one character. 3. Error reporting logic in the exception attribute setters/getters may be non-standard. What is the standard way to report errors for C functions that don't return object pointers? ==0 for error and !=0 for success or ==0 for success and !=0 for error PyArg_ParseTuple returns true an success, PyObject_SetAttr returns true on failure, which one is the exception and which one the rule? 4. Assigning to an attribute of an exception object does not change the appropriate entry in the args attribute. Is this worth changing? 5. UTF-7 decoding does not yet take full advantage of the machinery: When an unterminated shift sequence is encountered (e.g. "+xxx") the faulty byte sequence has already been emitted. Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4