Compilers are notoriously bad at compiling the main loop of a programming language interpreter, and the CPython interpreter main loop is no exception: it is hard to compile it perfectly. The difficulty of compilation scales with the number of opcodes which the interpreter has, which in the case of CPython is more than 100, but we can actually get a feel for how well a compiler is doing by looking at just one opcode.
For this exercise I'll look at CPython 3.6's LOAD_FAST
opcode, which in C is:
TARGET(LOAD_FAST) {
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
format_exc_check_arg(PyExc_UnboundLocalError,
UNBOUNDLOCAL_ERROR_MSG,
PyTuple_GetItem(co->co_varnames, oparg));
goto error;
}
Py_INCREF(value);
PUSH(value);
FAST_DISPATCH();
}
This opcode does a very small task: it loads a single local variable and pushes it onto the Python stack. After expanding various macros, the code becomes:
TARGET(LOAD_FAST) {
PyObject *value = fastlocals[oparg];
if (value == NULL) {
}
value->ob_refcnt++;
*stack_pointer++ = value;
if (_Py_TracingPossible) {
}
f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr));
uint16_t word = *next_instr;
opcode = word & 255;
oparg = word >> 8;
next_instr++;
goto *opcode_targets[opcode];
}
With this code in hand, we can start looking at what compilers do with it, starting with gcc on Linux x64:
48 8B 44 24 38 mov rax, [rsp+38h]
49 63 F5 movsxd rsi, r13d
48 8B 04 F0 mov rax, [rax+rsi*8]
48 85 C0 test rax, rax
0F 84 86 3A 00 00 jz loc_545A2B
4C 89 F2 mov rdx, r14
48 83 00 01 add qword ptr [rax], 1
49 83 C6 08 add r14, 8
48 89 02 mov [rdx], rax
8B 05 97 AA 3B 00 mov eax, cs:_Py_TracingPossible
85 C0 test eax, eax
0F 85 A3 CD FF FF jnz loc_53ED64
48 89 EA mov rdx, rbp
48 2B 14 24 sub rdx, [rsp]
48 83 C5 02 add rbp, 2
48 D1 FA sar rdx, 1
01 D2 add edx, edx
89 53 78 mov [rbx+78h], edx
0F B7 55 FE movzx edx, word ptr [rbp-2]
44 0F B6 C2 movzx r8d, dl
0F B6 F6 movzx esi, dh
49 63 D0 movsxd rdx, r8d
41 89 F5 mov r13d, esi
48 8B 14 D5 20 51 60 00 mov rdx, ds:opcode_targets_11490[rdx*8]
FF E2 jmp rdx
From this, we can infer that gcc made the following choices:
fastlocals
is stored on the stack at [rsp+38h]
.oparg
is stored in the register r13d
.stack_pointer
is stored in the register r14
.next_instr
is stored in the register rbp
.first_instr
is stored on the stack at [rsp]
.f
is stored in the register rbx
.We can also spot a number of sad things:
[rsp+38h]
was rbx+178h
(fastlocals
is an array member of f
).movsxd rsi, r13d
could be avoided if gcc knew that r13d
(oparg
) was never negative (oparg
is stupidly declared as int
rather than unsigned int
, and proving that it can never be negative requires a bit of effort and language-lawyering [1]).mov rdx, r14
feels silly; gcc should transpose add r14, 8
and mov [rdx], rax
, then this instruction wouldn't be required (as a register-register move, it doesn't consume any execution resources, but it still consumes icache space and decoding resources).sar rdx, 1
and add edx, edx
could be avoided if gcc knew that first_instr
and next_instr
were always even. The preceding sub
could also be avoided if next_instr
was maintained as an offset from first_instr
rather than an absolute pointer.movsxd rdx, r8d
could be avoided if gcc realized that r8d
(opcode
) was never negative (which it should be able to spot fairly easily: it is the result of a movzx
instruction, and thus can only be in the range 0 through 255).mov r13d, esi
could be avoided by doing movzx esi, dh
directly into r13d
instead of into esi
(except that movzx r13d, dh
cannot be expressed in machine code [2], so the compiler would have to swap things around and put oparg
in say ebp
and next_instr
in say r13
).mov rdx, ds:opcode_targets_11490[rdx*8]
and jmp rdx
could be merged into a single instruction jmp ds:opcode_targets_11490[rdx*8]
(thus saving an instruction decode and a byte or two of icache).It feels like there are quite a few things which gcc could improve upon. Despite that, it could be the case that gcc is doing better than other compilers. To find out, we'll have to consider a few other compilers. With that in mind, we can look at what Clang on OSX x64 does:
49 63 F7 movsxd rsi, r15d
49 8B 84 F6 78 01 00 00 mov rax, [r14+rsi*8+178h]
48 85 C0 test rax, rax
0F 84 51 41 00 00 jz loc_F7B6D
48 FF 00 inc qword ptr [rax]
49 89 45 00 mov [r13+0], rax
49 83 C5 08 add r13, 8
44 8B 3D D2 D5 1A 00 mov r15d, cs:__Py_TracingPossible
45 85 FF test r15d, r15d
0F 85 B7 BB FF FF jnz loc_EF5EE
48 8B 85 48 FE FF FF mov rax, [rbp-1B8h]
48 2B 85 28 FE FF FF sub rax, [rbp-1D8h]
48 D1 F8 sar rax, 1
48 98 cdqe
48 01 C0 add rax, rax
41 89 46 78 mov [r14+78h], eax
48 8B 95 48 FE FF FF mov rdx, [rbp-1B8h]
0F B7 02 movzx eax, word ptr [rdx]
0F B6 D8 movzx ebx, al
0F B6 C4 movzx eax, ah
41 89 C7 mov r15d, eax
48 83 C2 02 add rdx, 2
48 89 95 48 FE FF FF mov [rbp+var_1B8], rdx
48 63 C3 movsxd rax, ebx
48 8D 0D 87 01 13 00 lea rcx, _opcode_targets_11343
48 8B 04 C1 mov rax, [rcx+rax*8]
FF E0 jmp rax
From this, we can infer that clang made the following choices:
oparg
is stored in the register r15d
.stack_pointer
is stored in the register r13
.next_instr
is stored on the stack at [rbp-1B8h]
.first_instr
is stored on the stack at [rbp-1D8h]
.f
is stored in the register r14
.Again, we can critique this assembly:
movsxd rsi, r15d
could be avoided if clang knew that oparg
was never negative.fastlocals
can be expressed as r14+178h
, which is good.inc qword ptr [rax]
versus add qword ptr [rax], 1
is debatable, though I tend to have a slight preference for add
(inc
might suffer a partial flags stall, but on the flip side is a slightly shorter encoding).r13
for stack_pointer
is a shame, as it requires [r13+0]
rather than just [r13]
. It would have been better to put say f
in r13
and stack_pointer
in r14
.mov [r13+0], rax
and add r13, 8
in the "right" order, thus avoiding the extra mov
performed by gcc.next_instr
should really have been assigned a register rather than a stack slot.cdqe
, add rax, rax
, mov [r14+78h], eax
is a disappointing instruction sequence: given the 32-bit store at the end, Clang should have spotted that add rax, rax
could be turned into add eax, eax
, and thus that cdqe
could be dropped entirely.mov rdx, [rbp-1B8h]
could be avoided by instead doing mov rdx, rax
immediately after the preceding mov rax, [rbp-1B8h]
.mov
in movzx eax, ah
, mov r15d, eax
could be elimited had Clang chosen a more suitable register for oparg
[2].movsxd rax, ebx
could be avoided if Clang realized that ebx
(opcode
) was never negative (which it should be able to spot fairly easily: it is the result of a movzx
instruction, and thus can only be in the range 0 through 255).lea rcx, _opcode_targets_11343
rather than fusing the address of the jump table into the next memory load.mov rax, [rcx+rax*8]
and jmp rax
could be fused into a single instruction.Overall, Clang did some things better than gcc, made some of the same mistakes, and did some things worse than gcc.
Next up is MSVC on Windows x64. This compiler is at a slight disadvantage, as it doesn't supported computed goto
statements, and has to instead fall back to using a switch
statement. Bearing that in mind, the assembly is:
48 63 C3 movsxd rax, ebx
49 8B 8C C7 78 01 00 00 mov rcx, [r15+rax*8+178h]
48 85 C9 test rcx, rcx
0F 84 5C 68 04 00 jz loc_1E0791BF
48 FF 01 inc qword ptr [rcx]
49 89 0C 24 mov [r12], rcx
49 83 C4 08 add r12, 8
4C 89 64 24 48 mov [rsp+48h], r12
E9 77 FF FF FF jmp loc_1E0328EF
loc_1E0328EF:
48 8B C2 mov rax, rdx
49 2B C6 sub rax, r14
48 D1 F8 sar rax, 1
03 C0 add eax, eax
83 3D 8F 2F 33 00 00 cmp cs:_Py_TracingPossible, 0
41 89 47 78 mov [r15+78h], eax
0F 85 DB 3D 04 00 jnz loc_1E0766E6
0F B7 1A movzx ebx, word ptr [rdx]
44 0F B6 EB movzx r13d, bl
C1 EB 08 shr ebx, 8
48 83 C2 02 add rdx, 2
48 89 54 24 40 mov [rsp+40h], rdx
89 5C 24 70 mov dword ptr [rsp+70h], ebx
0F 1F 40 00 nop dword ptr [rax+00h]
db 66h, 66h, 66h
66 0F 1F 84 00 00 00 00 nop word ptr [rax+rax+00000000h]
41 8D 45 FF lea eax, [r13-1]
3D 9D 00 00 00 cmp eax, 9Dh
0F 87 48 6A 04 00 ja loc_1E079387
48 63 C8 movsxd rcx, eax
41 8B 84 88 CC 67 03 00 mov eax, ds:(off_1E0367CC - 1E000000h)[r8+rcx*4]
49 03 C0 add rax, r8
FF E0 jmp rax
For MSVC, we can infer:
oparg
is stored in the register ebx
, and also on the stack at [rsp+70h]
.stack_pointer
is stored in the register r12
, and also on the stack at [rsp+48h]
.next_instr
is stored in register rdx
, and also on the stack at [rsp+40h]
.first_instr
is stored in register r14
.f
is stored in the register r15
.As per the established pattern, the critique on MSVC's code is:
movsxd rax, ebx
is, yet again, the fault of oparg
being signed.fastlocals
can be expressed as r15+178h
, which is good.inc
over add
, which is debatable.mov [r12], rcx
and add r12, 8
in the right order.stack_pointer
is good, as it avoids +0
in memory operands.jmp loc_1E0328EF
is suffered because MSVC lacks computed goto
statements.f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
and if (_Py_TracingPossible)
are interleaved slightly, which is cute.cdqe
and add rax, rax
, instead opting for add eax, eax
. Alas, like gcc, it still doesn't realise that neither the sar rax, 1
nor the add eax, eax
are neccessary.shr ebx, 8
is done instead of movzx ebx, bh
- I'd have a slight preference for the latter, but there isn't much in it. At least the choice of registers avoids the gratuitous mov
done by gcc and clang.lea eax, [r13-1]
, cmp eax, 9Dh
, ja loc_1E079387
.movsxd rcx, eax
could be avoided if MSVC realized that eax
(opcode - 1
) was never negative (which it should be able to spot fairly easily: by means of cmp eax, 9Dh
, ja loc_1E079387
it has already verified that eax
is between 0 and 157 inclusive).add rax, r8
instruction to turn an offset into an absolute pointer.MSVC ends up being like gcc in some regards, like clang in others, and sometimes unlike either. The lack of computed goto
statements is certainly painful though, and accounts for four entries in the critique list.
Having bashed the three major compilers for being imperfect, I'm now obliged to provide what I think is the perfect assembly code for this opcode - if I was writing the CPython interpreter main loop in assembly [3] then this is what I'd write for LOAD_FAST
:
48 8B 94 CB 00 01 00 00 mov rdx, [rbx+rcx*8+100h]
41 0F B7 04 2E movzx eax, word ptr [r14+rbp]
48 85 D2 test rdx, rdx
0F 84 F7 12 00 00 jz loc_1E00696D
49 89 14 24 mov [r12], rdx
49 83 C4 08 add r12, 8
89 2B mov [rbx], ebp
48 83 02 01 add qword ptr [rdx], 1
41 F6 47 F8 01 test byte ptr [r15-8], 1
0F 85 8F BA FF FF jnz loc_1E00111E
0F B6 CC movzx ecx, ah
0F B6 C0 movzx eax, al
83 C5 02 add ebp, 2
41 FF 24 C7 jmp qword ptr [r15+rax*8]
My assembly makes the following register assignment choices:
oparg
is stored in the register ecx
(at least on Windows; on other platforms it would be in edi
- in both cases this happens to coincide with the register which the normal calling convention uses for the first argument).stack_pointer
is stored in the register r12
.next_instr
from first_instr
is stored in register ebp
.first_instr
is stored in register r14
.f+78h
is stored in the register rbx
(in particular, this means that accesses to f->f_lasti
don't need a displacement, and accesses to some fields toward the end of PyFrameObject
can be accessed with a one-byte displacement rather than a four-byte displacement).r15
.The combination of storing next_instr
as an offset and keeping f
biased by offsetof(PyFrameObject, f_lasti)
means that f->f_lasti = (sizeof(uint16_t) * (int)(next_instr - first_instr))
is two bytes / one instruction, versus 19 bytes / six instructions for gcc. Keeping f
biased has no downside, and has the occasional other upside (accesses to some fields toward the end of PyFrameObject
can be accessed with a one-byte displacement rather than a four-byte displacement). Storing next_instr
as an offset has the minor downside of making the *next_instr
memory operand slightly more complex ([14+rbp]
rather than [rbp]
), but this is a very low cost, and the offset approach also makes certain jump-related opcodes slightly cleaner and avoids a REX prefix on next_instr++
. Keeping the jump table address in r15
is expensive (as POSIX x64 only has six non-volatile registers, and this burns one of those six for a runtime constant), but makes opcode dispatch cheap (which is important, given that dispatch is replicated into all 100+ opcodes), and has some upsides (e.g. rip
-relative lea
instructions can instead be r15
-relative, and thus be executed on a wider range of ports). I also change _Py_TracingPossible
from being a 32-bit variable to being a 1-bit variable, and put this variable just before the jump table (so that it can be addressed with a one-byte offset from r15
). The other notable thing to point out is pulling word = *next_instr
up towards the start of the instruction steam - I want to give the CPU as much time as possible to perform that load, as it is critical for control-flow.
That is one opcode - LOAD_FAST
- considered in detail. Only 100+ other opcodes to go...
[1] There are two kinds of assignment to oparg
: one kind we've already seen, namely oparg = word >> 8
, which fairly obviously can't make oparg
negative. The other kind is in the EXTENDED_ARG
opcode, which does oparg |= oldoparg << 8;
: we have to appeal to language lawyering to claim that oldoparg
being non-negative implies that oldoparg << 8
is non-negative (signed overflow is undefined and all that). Then one simple step to claim that oparg
being non-negative and oldoparg << 8
being non-negative implies oparg | (oldoparg << 8)
is non-negative.
[2] The ah
/bh
/ch
/dh
registers can only be accessed if a REX prefix is not used. The r8
through r15
registers can only be accessed if a REX prefix is used. QED.
[3] If I was writing the CPython interpreter main loop in assembly. If. I mean, I'd have to be crazy to write that much assembly...
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4