Tim's almost as good at convincing me as he is at channeling me! The timings he showed almost convinced me that fileinput is hopeless and xreadlines should be added. But then I wrote a little timer of my own... I am including the timer program below my signature. The test input was the current access_log of dinsdale.python.org, which has about 119 Mbytes and 1M lines (as counted by the test program). I measure about a factor of 2 between readlines with a sizehint (of 1 MB) and fileinput; a change to fileinput that uses readline with a sizehint and in-lines the common case in __getitem__ (as suggested by Moshe), didn't make a difference. Output (the first time is realtime seconds, the second CPU seconds): total 119808333 chars and 1009350 lines count_chars_lines 7.944 7.890 readlines_sizehint 5.375 5.320 using_fileinput 15.861 15.740 while_readline 8.648 8.570 This was on a 600 MHz Pentium-III Linux box (RH 6.2). Note that count_chars_lines and readlines_sizehint use the same algorithm -- the difference is that readlines_sizehint uses 'pass' as the inner loop body, while count_chars_lines adds two counters. Given that very light per-line processing (counting lines and characters) already increases the time considerably, I'm not sure I buy the arguments that the I/O overhead is always considerable. The fact that my change to fileinput.py didn't make a difference suggests that its lack of speed it purely caused by the Python code. Now what to do? I still don't like xreadlines very much, but I do see that it can save some time. But my test doesn't confirm Neel's times as posted by Tim: > Slowest: for line in fileinput.input('foo'): # Time 100 > : while 1: line = file.readline() # Time 75 > : for line in LinesOf(open('foo')): # Time 25 > Fastest: for line in file.readlines(): # Time 10 > while 1: lines = file.readlines(hint) # Time 10 > for line in xreadlines(file): # Time 10 I only see a factor of 3 between fastest and slowest, and readline is only about 60% slower than readlines_sizehint. --Guido van Rossum (home page: http://www.python.org/~guido/) import time, fileinput, sys def timer(func, *args): t0 = time.time() c0 = time.clock() func(*args) t1 = time.time() c1 = time.clock() print "%-20s %6.3f %6.3f" % (func.__name__, t1-t0, c1-c0) def count_chars_lines(fn, bs=1024*1024): nl = 0 nc = 0 f = open(fn, "r") while 1: buf = f.readlines(bs) if not buf: break for line in buf: nl += 1 nc += len(line) f.close() print "total", nc, "chars and", nl, "lines" def readlines_sizehint(fn, bs=1024*1024): f = open(fn, "r") while 1: buf = f.readlines(bs) if not buf: break for line in buf: pass f.close() def using_fileinput(fn): f = fileinput.FileInput(fn) for line in f: pass f.close() def while_readline(fn): f = open(fn, "r") while 1: line = f.readline() if not line: break pass f.close() fn = "/home/guido/access_log" if sys.argv[1:]: fn = sys.argv[1] timer(count_chars_lines, fn) timer(readlines_sizehint, fn, 1024*1024) timer(using_fileinput, fn) timer(while_readline, fn)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4