FYI. It would be interesting to collect some feedback on these ideas for popular combos. Who knows... Forwarded message: > From morton@dennisinter.com Fri Jul 28 20:48:29 2000 > From: morton@dennisinter.com (Damien Morton) > To: <Vladimir.Marangozov@inrialpes.fr> > Subject: further optimising the micro-optimisations for cache locality > Date: Fri, 28 Jul 2000 14:34:29 -0400 > Message-ID: <NDBBLENPLCPAHKODFHNGOEADEBAA.morton@dennisinter.com> > MIME-Version: 1.0 > Content-Type: text/plain; > charset="iso-8859-1" > Content-Transfer-Encoding: 7bit > X-Priority: 3 (Normal) > X-MSMail-Priority: Normal > X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 > Importance: Normal > > Folded version: > > case BINARY_ADD: > w = POP(); > v = POP(); > x = PyNumber_Add(v, w); > goto end_binop; > > case BINARY_MULTIPLY: > w = POP(); > v = POP(); > x = PyNumber_Multiply(v, w); > goto end_binop; > > ... > end_binop: > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > > Further folded version: > > > > case BINARY_POWER: > f = PyNumber_Add; > goto do_binop; > > case BINARY_MULTIPLY: > f = PyNumber_Multiply; > goto do_binop; > > ... > do_binop: > w = POP(); > v = POP(); > x = f(v, w); > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > > Further Further folded version: > (i havent looked at the mainloop code, but can guess its structure) > > i assume 'op' is the opcode we are swtching on > > // the function to call for this op > (void *)() op_func[] = { ..., PyNumber_Add, PyNumber_Multiply, ... }; > > // the kind of op this func describes > unsigned char op_type[] = { ..., DO_BINOP1, DO_BINOP2, DO_UNOP, ... }; > > // these two tables will be cached because of the frequency the are accessed > // ive used a table of pointers and a table of bytes to reduce the memory > required > // because the tables are small, locality within and between the tables isnt > important > // might be an idea to force the tables into contiguous memory somehow > // i could also have used a table of structs, but alignment would increase > memory usage > > unsigned char op; > > switch(op_type[op]) { > case DO_BINOP1: > w = POP(); > v = POP(); > x = op_func[op](v, w); > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > case DO_BINOP2: > w = POP(); > v = POP(); > x = op_func[op](v, w, Py_None); > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > case DO_UNOP: > v = POP(); > x = op_func[op](v); > Py_DECREF(v); > PUSH(x); > if (x != NULL) continue; > break; > > ===================== original message ================================= > Guido van Rossum wrote: > > > > > [me, on micro-optimizing the main loop] > > > > Fair enough (for one particular CPU at least). > > Since we're at it, it's worth mentioning another conclusion we came across > at the time: the cache effects in the main loop are significant -- it is > more important to try keeping at best the main loop small enough, so that > those effects are minimized. > > An experiment I did at the time which gave some delta-speedup: > I folded most of the UNOP & BINOP opcodes since they differ only by the > functon they call and they duplicate most of the opcode body. Like this: > > Original: > case BINARY_POWER: > w = POP(); > v = POP(); > x = PyNumber_Power(v, w, Py_None); > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > > case BINARY_MULTIPLY: > w = POP(); > v = POP(); > x = PyNumber_Multiply(v, w); > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > ... > > > Folded version: > > case BINARY_POWER: > w = POP(); > v = POP(); > x = PyNumber_Power(v, w, Py_None); > goto end_binop; > > case BINARY_MULTIPLY: > w = POP(); > v = POP(); > x = PyNumber_Multiply(v, w); > goto end_binop; > > ... > end_binop: > Py_DECREF(v); > Py_DECREF(w); > PUSH(x); > if (x != NULL) continue; > break; > > > This reduced the code size of ceval.c, which resulted in less cache effects > and gave more speed, despite the additional jumps. It possibly results in > less page-faults too, although this is unlikely. > > Which makes me think that, if we want to do something about cache effects, > it is probably not a bad idea to just "reorder" the bytecodes in the big > switch by decreasing frequency (we have some stats about this -- I believe > Skip and MAL have discussed the opcodes' frequency and the charts lie > somewhere in the archives). I remember Marc-Andre had done something in > this direction and reported some perf improvements too. Since reordering > the opcodes doesn't really hurt, if I'm about to do something with the > main loop, it'll be only this. > > > > > > Micro-optimization doesn't worth the effort. As to the 2-byte arg limit, > > > I think I'd be happy if the compiler raises an exception if this limit > > > is exceeded. > > > > That would be a good first step indeed! > > I'll try to have a look, if someone else is not more reactive than me. > > -- > Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr > http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252 > -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4