Hi, pje wrote: > I think the idea is more that you could skip the 'Py_INCREF(Py_None)', > which is a fairly common prelude to 'return Py_None'. You'd set their > refcounts to the maximum possible value, and the deallocation function > would simply reset the refcount to maximum again. > > I'm not sure, however, that this would be common enough to be helpful. It > seems to me Py_INCREF should effectively translate to only a single machine > instruction or two. I mean, it's just incrementing an integer whose > address is known at compile time. I don't think it would matter much performance-wise. A quick scan does not reveal a single Py_INCREF(Py_None) in any hot code section. But with more than 800 occurences it would reduce the code size a bit (5-10K). So that alone is not a good reason to throw it out. Checking for Py_None in Py_DECREF is counter-productive because that would need another branch. The existing branch is always taken with Py_None since the refcount never goes zero. I.e. this is not an issue. But let's look at the general case. Here's a common fragment of inlined x86 code for Py_DECREF: mov -something(%ebp),%esi ; odep %esi mov (%esi),%eax ; idep %esi, odep %eax dec %eax ; idep %eax, odep %eax test %eax,%eax ; [idep %eax], odep flags mov %eax,(%esi) ; [idep %eax %esi] jne .L1 ; idep flags, predict sub $0xc,%esp mov 0x4(%esi),%eax push %esi call *0x18(%eax) add $0x10,%esp .L1: There are at least 2 to 3 data dependencies that are hard to untangle. Tough luck. The other problem is branch prediction. I've done a little analysis with gcov on ceval.c and found that the situation is not that bad. The Py_DECREF occurences that account for 90% of the time are almost exclusively 0/100 or 100/0 cases. The remaining 10% have a few 30/70 or 50/50 cases. The folded and weighted average is around 4/96. The branch misprediction penalty accounts for less than 0.1% of the runtime. But the deallocation itself is costly. On average the deallocator is called 15% of the time (I checked ceval.c only). This is a lot and accounts for up to 5% of the runtime (this is based on an overall profile). My guess is that allocation is a bit cheaper, but adds another 2-3%. I think tuples are the worst offender (again only a good guess from my profiling experiments). To summarize: I think we can optimize the GC a little more (have not done an in-depth analysis yet), but we really should focus on making Python less malloc-happy. Suggestions welcome! Bye, Mike
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4