On Tue, Sep 28, 2010 at 9:29 PM, Michael Foord <fuzzyman at voidspace.org.uk> wrote: > On 28/09/2010 12:19, Antoine Pitrou wrote: >> On Mon, 27 Sep 2010 23:45:45 -0400 >> Steve Holden<steve at holdenweb.com> wrote: >>> On 9/27/2010 11:27 PM, Benjamin Peterson wrote: >>>> Tokenize only works on bytes. You can open a feature request if you >>>> desire. >>>> >>> Working only on bytes does seem rather perverse. >> >> I agree, the morality of bytes objects could have been better :) >> > The reason for working with bytes is that source data can only be correctly > decoded to text once the encoding is known. The encoding is determined by > reading the encoding cookie. > > I certainly wouldn't be opposed to an API that accepts a string as well > though. A very quick scan of _tokenize suggests it is designed to support detect_encoding returning None to indicate the line iterator will return already decoded lines. This is confirmed by the fact the standard library uses it that way (via generate_tokens). An API that accepts a string, wraps a StringIO around it, then calls _tokenise with an encoding of None would appear to be the answer here. A feature request on the tracker is the best way to make that happen. Cheers, Nick. -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4