On 17.03.16 16:55, Guido van Rossum wrote: > On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com> wrote: >>> Should we recommend that everyone use tokenize.detect_encoding()? >> >> Likely. However the interface of tokenize.detect_encoding() is not very >> simple. > > I just found that out yesterday. You have to give it a readline() > function, which is cumbersome if all you have is a (byte) string and > you don't want to split it on lines just yet. And the readline() > function raises SyntaxError when the encoding isn't right. I wish > there were a lower-level helper that just took a line and told you > what the encoding in it was, if any. Then the rest of the logic can be > handled by the caller (including the logic of trying up to two lines). The simplest way to detect encoding of bytes string: lines = data.splitlines() encoding = tokenize.detect_encoding(iter(lines).__next__)[0] If you don't want to split all data on lines, the most efficient way in Python 3.5 is: encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0] In Python 3.5 io.BytesIO(data) has constant complexity. In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example: def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:] encoding = tokenize.detect_encoding(iterlines(data).__next__)[0] or it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detect_encoding(it.__next__) I don't know what approach is more efficient.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4