Martin v. Löwis wrote: > Walter Dörwald wrote: > >>This is caused by the chances to the codecs in 2.4. Basically the codecs >>no longer rely on C's readline() to do line splitting (which can't work >>for UTF-16), but do it themselves (via unicode.splitlines()). > > That explains why you get any calls to IsLineBreak; it doesn't explain > why you get so many of them. > > I investigated this a bit, and one issue seems to be that > StreamReader.readline performs splitline on the entire input, only to > fetch the first line. It then joins the rest for later processing. > In addition, it also performs splitlines on a single line, just to > strip any trailing line breaks. This is because unicode.splitlines() is the only API available to Python that knows about unicode line feeds. > The net effect is that, for a file with N lines, IsLineBreak is invoked > up to N*N/2 times per character (atleast for the last character). > > So I think it would be best if Unicode characters exposed a .islinebreak > method (or, failing that, codecs just knew what the line break > characters are in Unicode 3.2), and then codecs would split off > the first line of input itself. I think a maxsplit argument (just as for unicode.split()) would help too. > [...] Bye, Walter Dörwald
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4