[Guido, writes a timing program] [Jeff, if you weren't copied on all this stuff, you can play catch-up by reading the archives, at http://mail.python.org/pipermail/python-dev/ ] > ... > I am including the timer program below my signature. The test input > was the current access_log of dinsdale.python.org, which has about 119 > Mbytes and 1M lines (as counted by the test program). For a contrast, I cobbled together a large test file out of various chunks of C source, .py source, HTML source, and email archives. I was shooting for the same size you used (~119Mb), but ended up with more than 3x as many lines. > I measure about a factor of 2 between readlines with a sizehint (of 1 > MB) and fileinput; Factor of 7 here (Jeff, NeilS eventually figured out that Guido was using a CVS version of Python that has AndrewK's glibc getline patch, a zippier line-input routine than Python 2.0 has; but it only applies to platforms using glibc). > ... > Output (the first time is realtime seconds, the second CPU seconds): > > total 119808333 chars and 1009350 lines > count_chars_lines 7.944 7.890 > readlines_sizehint 5.375 5.320 > using_fileinput 15.861 15.740 > while_readline 8.648 8.570 > > This was on a 600 MHz Pentium-III Linux box (RH 6.2). total 117615824 chars and 3237568 count_chars_lines 14.780 14.772 readlines_sizehint 9.390 9.375 using_fileinput 66.130 66.157 while_readline 30.380 30.337 866 MHz P3 Win98SE, current CVS Python. I have no handy explanation for why clock() and time() differ on my box (Win98 has no notions of "user time" or "CPU time" distinct from clock time). > Note that count_chars_lines and readlines_sizehint use the same > algorithm -- the difference is that readlines_sizehint uses 'pass' as > the inner loop body, while count_chars_lines adds two counters. > > Given that very light per-line processing (counting lines and > characters) already increases the time considerably, I'm not sure I > buy the arguments that the I/O overhead is always considerable. I disagree that this is "very light processing", although I agree it's hard to think of lighter processing <wink>: it's a few Python statements per line, which I'd say is pretty *typical* processing. Read a line, run a string find or regexp search on it, test the result, sometimes fiddle the line accordingly and sometimes not. File-crunching apps generally aren't rocket science! For example, I changed count_chars_lines to tally the number of lines containing the string "Guido" instead, and the runtime went up by just 0.8 seconds (BTW, it found 13808 of them <wink>): if you're thinking in C terms, millions of failing searches for "Guido" may seem like more work, but the number of Python stmts executed usually counts more than what the stmts do at the C level. > ... > Now what to do? I still don't like xreadlines very much, but I do > see that it can save some time. But my test doesn't confirm Neel's > times as posted by Tim: > >> Slowest: for line in fileinput.input('foo'): # Time 100 >> : while 1: line = file.readline() # Time 75 >> : for line in LinesOf(open('foo')): # Time 25 >> Fastest: for line in file.readlines(): # Time 10 >> while 1: lines = file.readlines(hint) # Time 10 >> for line in xreadlines(file): # Time 10 > > I only see a factor of 3 between fastest and slowest, and > readline is only about 60% slower than readlines_sizehint. I don't know what Neel used for an input file, or which platform he used either. And this is bound to vary a lot across platforms. As above, I saw a factor of 7 between fastest and slowest and a factor of 3 between readline and readlines_sizehint. BTW, on my platform the Perl script (using a recent ActiveState Windows Perl) open(FILE, "ga.txt"); while (<FILE>) { 1; } ran in about 6 seconds (I never figured how to get Perl to compute usable timings itself)-- substantially faster than even readlines_sizehint! --and changing the body to $nc = $nl = 0; while (<FILE>) { ++$nl; $nc += length; } print "$nc $nl\n"; boosted that to about 8 seconds. So Perl has gotten zippier too over the years.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4