Martin v. Loewis wrote: > Walter Dörwald <walter@livinglogic.de> writes: > > >>Output is as follows: >>1790000 chars, 2.330% unenc >>ignore: 0.022 (factor=1.000) >>xmlcharrefreplace: 0.044 (factor=1.962) >>xml2: 0.267 (factor=12.003) >>xml3: 0.723 (factor=32.506) >>workaround: 5.151 (factor=231.702) >>i.e. a 1.7MB string with 2.3% unencodable characters was >>encoded. > > > Those numbers are impressive. Can you please add > > def xml4(exc): > if isinstance(exc, UnicodeEncodeError): > if exc.end-exc.start == 1: > return u"&#"+str(ord(exc.object[exc.start]))+u";" > else: > r = [] > for c in exc.object[exc.start:exc.end]: > r.extend([u"&#", str(ord(c)), u";"]) > return u"".join(r) > else: > raise TypeError("don't know how to handle %r" % exc) > > and report how that performs (assuming I made no error)? You must return a tuple (replacement, new input position) otherwise the code is correct. It tried it and two new versions: def xml5(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#%d;" % ord(exc.object[exc.start]), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) def xml6(exc): if isinstance(exc, UnicodeEncodeError): return (u"&#" + str(ord(exc.object[exc.start]) + u";"), exc.start+1) else: raise TypeError("don't know how to handle %r" % exc) Here are the results: 1790000 chars, 2.330% unenc ignore: 0.022 (factor=1.000) xmlcharrefreplace: 0.042 (factor=1.935) xml2: 0.264 (factor=12.084) xml3: 0.733 (factor=33.529) xml4: 0.504 (factor=23.057) xml5: 0.474 (factor=21.649) xml6: 0.481 (factor=22.010) workaround: 5.138 (factor=234.862) >>Using a callback instead of the inline implementation is a factor of >>12 slower than ignore. > > > For the purpose of comparing C and Python, this isn't relevant, is it? > Only the C version of xmlcharrefreplace and a Python version should be > compared. I was just to lazy to code this. ;) Python is a factor of 2.7 slower than the C callback (or 1.9 for your version). >>It can't really be fixed for codecs implemented in Python. For codecs >>that use the C functions we could add the functionality that e.g. >>PyUnicodeEncodeError_SetReason(exc) sets exc.reason and exc.args[3], >>but AFAICT it can't be done easily for Python where attribute assignment >>directly goes to the instance dict. > > > You could add methods into the class set_reason etc, which error > handler authors would have to use. > > Again, these methods could be added through Python code, so no C code > would be necessary to implemenet them. > > You could even implement a setattr method in Python - although you'ld > have to search this from C while initializing the class. For me this sounds much more complicated than the current C functions, especially for using them from C, which most codecs probably will. Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4