25.03.12 20:01, Antoine Pitrou написав(ла): > The general problem with decoding is that you don't know up front what > width (1, 2 or 4 bytes) is required for the result. The solution is > either to compute the width in a first pass (and decode in a second > pass), or decode in a single pass and enlarge the result on the fly > when needed. Both incur a slowdown compared to a single-size > representation. We can significantly reduce the number of checks, using the same trick that is used for fast checking of surrogate characters. While all characters < U+0100, we know that the result is a 1-byte string (ascii while all characters < U+0080). When meet a character >= U+0100, while all characters < U+10000, we know that the result is the 2-byte string. As soon as we met first character >= U+10000, we work with 4-bytes string. There will be several fast loops, the transition to the next loop will occur after the failure in the previous one. > It's probably a measurement error on your part. Anyone can test. $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4