Here's a first try at a PEP for making Python's codec error handling customizable. This has been briefly discussed before on the I18N-SIG mailing list. A sample implementation of the proposed feature is available as a SourceForge patch. Bye, Walter Dörwald --------------------------------------------------------------------------- PEP: ??? Title: Codec error handling callbacks Version: $Revision$ Last-Modified: $Date$ Author: walter@livinglogic.de (Walter Dörwald) Status: Draft Type: Standards Track Python-Version: 2.3 Created: ??-???-2001 Post-History: 04-Oct-2001 Abstract This PEP aims at extending Python's fixed codec error handling schemes with a more flexible callback based approach. Python currently uses a fixed error handling for codec error handlers. This PEP describes a mechanism which allows Python to use function callbacks as error handlers. With these more flexible error handlers it is possible to add new functionality to existing codecs by e.g. providing fallback solutions or different encodings for cases where the standard codec mapping does not apply. Problem A typical case is where a character encoding does not support the full range of Unicode characters. For these cases many high level protocols support a way of escaping a Unicode character (e.g. Python itself support the \x, \u and \U convention, XML supports character references via &#xxxx; etc.). When implementing such an encoding algorithm, a problem with the current implementation of the encode method of Unicode objects becomes apparent: For determining which characters are unencodable by a certain encoding, every single character has to be tried, because encode does not provide any information about the location of the error(s), so # (1) us = u"xxx" s = us.encode(encoding) has to be replaced by # (2) us = u"xxx" v = [] for c in us: try: v.append(c.encode(encoding)) except UnicodeError: v.append("&#" + ord(c) + ";") s = "".join(v) This slows down encoding dramatically as now the loop through the string is done in Python code and no longer in C code. Furthermore this solution poses problems with stateful encodings. For example UTF16 uses a Byte Order Mark at the start of the encoded byte string to specify the byte order. Using (2) with UTF16, results in an 8 bit string with a BOM between every character. To work around this problem, a stream writer - which keeps state between calls to the encoding function - has to be used: # (3) us = u"xxx" import codecs, cStringIO as StringIO writer = codecs.lookup(encoding)[3] v = StringIO.StringIO() uv = writer(v) for c in us: try: uv.write(c) except UnicodeError: uv.write(u"&#%d;" % ord(c)) s = str(v) To compare the speed of (1) and (3) the following test script has been used: # (4) import time us = u"äa"*1000000 encoding = "ascii" import codecs, cStringIO as StringIO t1 = time.time() s1 = us.encode(encoding, "replace") t2 = time.time() writer = codecs.lookup(encoding)[3] v = StringIO.StringIO() uv = writer(v) for c in us: try: uv.write(c) except UnicodeError: uv.write(u"?") s2 = v.getvalue() t3 = time.time() assert(s1==s2) print "1:", t2-t1 print "2:", t3-t2 print "factor:", (t3-t2)/(t2-t1) On Linux with Python 2.1 this gives the following output: 1: 0.395456075668 2: 126.909595966 factor: 320.919575586 i.e. (3) is 320 times slower than (1). Solution 1 Add a position attribute to UnicodeError instances. When the encoder encounters an unencodable character it raises an exception that specifies the position in the Unicode object where the unencodable character occurs. The application can then reencode the Unicode object up to the offending character, replace the offending character with something appropriate and retry encoding with the rest of the Unicode string until encoding is finished. A stream writer will write everything up to the offending character and then raise an exception, so the application only has to replace the offending character and retry the rest of the string. Advantage Requires minor changes to all the encoders/stream writers. Disadvantage As the encoder has to be called multiple times, this won't work with stateful encoding, so a stream writer has to be used in all cases. If unencodable characters occur very often Solution 1 will probably not be much faster than (3). E.g. for the string u"a"*10000+u"ä" all the characters but the last one will have to be encoded twice when using an encoder (but only once when using a stream writer). This solution is specific to encoding and can't be extended to decoding easily. Solution 2 Add additional error handling names for every needed replacement scheme (e.g. "xmlcharrefreplace" for "&#%d;" or "escapereplace" for "\\x%02x" / "\\u%04x" / "\\U%08x") Advantage Requires minor changes to all encoders/stream writers. As the replacement scheme is implemented directly in the encoder this should be the fastest solution. Disadvantages The available replacement schemes are hardwired. Additional replacement schemes require that all encoders/decoders are changed again. This is especially problematic for decoding where users might want to implement various error handlers for handling broken text files. Solution 3 The errors argument is no longer a string, but a callback function: This callback function gets passed the original unicode object and the position of the unencodable character and returns a new unicode object that should be encoded instead of the unencodable one. (Note that this requires that the encoder *must* be able to encode what is returned from the handler. If not a normal UnicodeError will be raised.) Example code could look like this: us = u"xxx" def xmlescape(uni, pos): return u"&#%d;" % ord(uni[pos]) s = us.encode(encode, xmlescape) Advantages This makes the error handling completely customizable. The error handler may choose to raise an exception or do any kind of replacement required. The interface between the encoder and the error handler can be designed in a way that this mechanism can be used for decoding too: The original 8bit string is passed to the error handler and the error handler returns a replacement unicode object and a resyncronization position where decoding should continue. Disadvantages This solutions requires changes to every C function that has "const char *errors" arguments. (e.g. PyUnicode_EncodeLatin1( const Py_UNICODE *p, int size, const char *errors) has to be changed to PyUnicode_EncodeLatin1( const Py_UNICODE *p, int size, PyObject *errors) (To provide backwards compatibility the PyUnicode_EncodeLatin1 signature remains the same, a new function PyUnicode_EncodeLatin1Ex is introduced with the new signature. PyUnicode_EncodeLatin1 simply calls the new function.) Solution 4 The errors argument is still a string, but this string is used to lookup a error handling callback function from a callback registry. Advantage No changes to the Python C API are required. Well known error handling schemes could be implemented directly in the encoder/decoder for maximum speed. Like solution 3 this can be done for encoders and decoders. Disadvantages Currently all encoding/decoding functions have arguments const Py_UNICODE *p, int size or const char *p, int size to specify the unicode characters/8bit characters to be encoded/decoded. In case of a error a new unicode or str object has to be created from this information and passed to the error handler. This has to be done either for every occuring error or once on the first error and the result object must be kept until the end of the function in case another error occurs. Most functions that call the codec functions work with unicode/str objects anyway, so they have to extract the const Py_UNICODE */int arguments and pass it to the functions, which has to reconstruct the object in case of an error. Sample implementation A sample implementation is available on SourceForge [1]. This patch implements a combination of solutions 3 and 4, i.e. it is possible to pass functions *and* registered callback names to unicode.encode. The interface between the encoder and the handler has been extended to be able to support the same interface for encoding and decoding. The encoder/decoder passes a tuple to the callback with the following entries: 0: the name of the encoding 1: the original Unicode object/the original 8bit string 2: the position of the unencodable character/byte 3: the reason why the character/byte can't be encoded/decoded as a string (e.g. "character not in range(128)" for encoding or "unexpected end of data", "invalid code byte" etc. for decoding) 4: an additional object that can be used to pass state information to the callback. All implemented encoders/decoders currently pass None for this. The callback must return a tuple with the following info: 0: a Unicode object which will be encoded instead of the offending character (for encoders). A Unicode object that will be used as a replacement for the undecodable bytes in the decoded Unicode object (for decoders) 1: a position where the encoding/decoding process should continue after processing the replacement string. The patch includes several preregistered encode error handlers schemes with the following names: "strict", "ignore", "replace", "xmlcharrefreplace", "escapereplace" and decode error handlers with the names: "strict", "ignore", "replace" The patch includes the other change to the C API described in Solution 4. The parameters const Py_UNICODE *p, int size have been replace by PyObject *p so that all functions get the Unicode object directly as a PyObject * and pass this directly along to the error callback. For further details see the patch on SourceForge. References [1] Python patch #432401 "unicode encoding error callbacks" http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=deta= il&aid=432401 [2] Previous discussions on this topics in the I18N-SIG mailing list http://mail.python.org/pipermail/i18n-sig/2000-December/000587.h= tml http://mail.python.org/pipermail/i18n-sig/2001-June/001173.html Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4