[Tim, to MarkF] >> You average over 255 chars/line? [nag, nag, nag] [Mark Favas] > Real-life input, my boy! It's actually a syslog from my > mailserver, consisting mainly of sendmail log messages, and I > have a current need to process these things (MS Exchange, > corrupted database, clobbered backup tapes), so this thread > came along at the right time... Hmm. I tuned ms_getline_hack for Guido's logfiles, which he said don't often exceed 160 chars/line. I guess if you're on a 64-bit platform, though, it must take about twice as many chars per line to record a log msg <wink>. > ... > Removing the buffer size arg in the call to readlines_sizehint results > in this (using up-to-the-minute CVS): > total 131426612 chars and 514216 lines > count_chars_lines 4.922 4.916 > readlines_sizehint 3.881 3.850 > using_fileinput 10.371 10.366 > while_readline 10.943 10.916 > for_xreadlines 2.990 2.967 > > and with an 8Kb sizehint: > total 131426612 chars and 514216 lines > count_chars_lines 5.241 5.216 > readlines_sizehint 2.917 2.900 > using_fileinput 10.351 10.333 > while_readline 10.990 10.983 > for_xreadlines 2.877 2.867 That's sure consistent across platforms, then. I guess we'll write it off to "cache effects" (a catch-all explanation for any timing mystery -- go ahead, just *try* to prove it's wrong <0.5 wink>). [and Mark has HAVE_GETC_UNLOCKED on his Tru64 Unix box, yet using_fileinput is quicker than while_readline] > With USE_MS_GETLINE_HACK and HAVE_GETC_UNLOCKED both #define'd > (although defining the former makes the latter def irrelevant): > (test_bufio also OK) > total 131426612 chars and 514216 lines > count_chars_lines 5.056 5.050 > readlines_sizehint 3.771 3.667 > using_fileinput 11.128 11.116 > while_readline 8.287 8.233 > for_xreadlines 3.090 3.083 So ms_getline_hack is significantly faster on your box (I'm only looking at while_readline: 11 using getc_unlocked, 8.3 using ms_getline_hack). There are only two reasons I can imagine for that: 1. Your vendor optimizes the inner loop in fgets (as all vendors should, but few do). and/or 2. Despite the long average length of your lines, many of them are nevertheless shorter than 200 chars, and so all the pain ms_getline_hack endures to avoid a realloc pays off. Unfortunately, there's not enough info to figure out if either, both, or none of those are on-target. It's such a large percentage speedup, though, that my bet goes primarily to #1 -- unless realloc is really pig slow on your box. Which some things *are*: > With USE_MS_GETLINE_HACK and HAVE_GETC_UNLOCKED both #undef'ed (just > for completeness): > total 131426612 chars and 514216 lines > count_chars_lines 4.916 4.900 > readlines_sizehint 3.875 3.867 > using_fileinput 14.404 14.383 > while_readline 322.728 321.837 > for_xreadlines 7.113 7.100 > > So, having HAVE_GETC_UNLOCKED #define'd does make a small improvement > <grin> Yes, that's the "platform from Mars" evidence I was seeking: if ms_getline_hack survives test_bufio on *your* crazy box, it's as close to provably correct as any algorithm in all of Computer Science <wink>. a-factor-of-39-is-almost-big-enough-to-notice!-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4