Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
Description
My working branch is here main...aconz2:cpython:aconz2/early-tail-call-load
I saw the recent merge of the tail call interpreter (#128718), very nice! I have played with this style of interpreter before and one thing that comes up is when to calculate the target address. As it is, the current interpreter does it in DISPATCH() by doing
DEF_TARGET(foo){// ...TAILreturnINSTRUCTION_TABLE[opcode](ARGS)}this results in assembly like:
0000000000289580 <_TAIL_CALL_GET_LEN>:289580: 50pushrax289581: 89 fb movebx,edi289583: 4d 89 7c 2438mov qword ptr [r12+0x38],r15289588: 4983 c7 02addr15,0x2 28958c: 49 8b 7d f8 movrdi, qword ptr [r13-0x8]289590: 4d 89 6c 2440mov qword ptr [r12+0x40],r13289595: e8 f6 fc ea ff call0x139290 <PyObject_Size> 28959a: 4d 8b 6c 2440movr13, qword ptr [r12+0x40] 28959f: 49 c7 44244000000000mov qword ptr [r12+0x40],0x0 2895a8: 4885 c0 testrax,rax 2895ab: 78 2b js0x2895d8 <_TAIL_CALL_GET_LEN+0x58> 2895ad: 4889 c7 movrdi,rax 2895b0: e8 eb 54 ec ff call0x14eaa0 <PyLong_FromSsize_t> 2895b5: 4885 c0 testrax,rax 2895b8: 74 1e je0x2895d8 <_TAIL_CALL_GET_LEN+0x58> 2895ba: 49894500mov qword ptr [r13],rax 2895be: 4983 c5 08addr13,0x8 2895c2: 41 0f b7 3f movzxedi, word ptr [r15] #<-- Load next_instr 2895c6: 40 0f b6 c7 movzxeax, dil #<-- grab opcode 2895ca: c1 ef 08shredi,0x8 2895cd: 48 8d 0d 7c 50 1f 00learcx,[rip+0x1f507c] # 0x47e650 <INSTRUCTION_TABLE> 2895d4: 5a poprdx 2895d5: ff 24 c1 jmp qword ptr [rcx+8*rax] #<--jmp with addr calculation 2895d8: 89 df movedi,ebx 2895da: 58poprax 2895db: e9 30 dc ff ff jmp0x287210 <_TAIL_CALL_error>where we jmp to a computed adress which is dependent on the lea and a memory load a few instructions prior.
Another method looks like
DEF_TARGET(foo){// ...tail_funcptrnext_f=INSTRUCTION_TABLE[next_opcode]; // ...TAILreturnnext_f(ARGS)}where we try to get the compiler to compute the target earlier and then have a jmp reg. We have to pay special attention to places where next_instr is modified and reload the pointer (though hopefully the optimizer will just wait to do the calculation until the latest place).
In this early branch, I was able to get this working enough to see what asm it would generate. For _TAIL_CALL_GET_LEN, the sequence now looks like
00000000002896b0 <_TAIL_CALL_GET_LEN>: 2896b0: 55pushrbp 2896b1: 89 fb movebx,edi 2896b3: 4d 89 7c 2438mov qword ptr [r12+0x38],r15 2896b8: 41 0f b6 4702movzxeax, byte ptr [r15+0x2] #<-- Load next instr opcode 2896bd: 4983 c7 02addr15,0x2 2896c1: 48 8d 0d 88 5f 1f 00learcx,[rip+0x1f5f88] # 0x47f650 <INSTRUCTION_TABLE> 2896c8: 48 8b 2c c1 movrbp, qword ptr [rcx+8*rax] #<-- load next target addr 2896cc: 49 8b 7d f8 movrdi, qword ptr [r13-0x8] 2896d0: 4d 89 6c 2440mov qword ptr [r12+0x40],r13 2896d5: e8 b6 fb ea ff call0x139290 <PyObject_Size> 2896da: 4d 8b 6c 2440movr13, qword ptr [r12+0x40] 2896df: 49 c7 44244000000000mov qword ptr [r12+0x40],0x0 2896e8: 4885 c0 testrax,rax 2896eb: 7820js0x28970d <_TAIL_CALL_GET_LEN+0x5d> 2896ed: 4889 c7 movrdi,rax 2896f0: e8 ab 53 ec ff call0x14eaa0 <PyLong_FromSsize_t> 2896f5: 4885 c0 testrax,rax 2896f8: 7413je0x28970d <_TAIL_CALL_GET_LEN+0x5d> 2896fa: 49894500mov qword ptr [r13],rax 2896fe: 4983 c5 08addr13,0x8289702: 41 0f b6 7f 01movzxedi, byte ptr [r15+0x1]289707: 4889 e8 movrax,rbp #<-- register rename 28970a: 5d poprbp 28970b: ff e0 jmprax #<--jmp to target addr 28970d: 89 df movedi,ebx 28970f: 5d poprbp289710: e9 fb da ff ff jmp0x287210 <_TAIL_CALL_error>289715: 6666 2e 0f 1f 840000000000nop word ptr cs:[rax+rax] 2896c1: 48 8d 0d 88 5f 1f 00learcx,[rip+0x1f5f88] # 0x47f650 <INSTRUCTION_TABLE> 2896c8: 48 8b 2c c1 movrbp, qword ptr [rcx+8*rax]Specifically in this case, both PyObject_Size and PyLong_FromSsize_t don't touch rbp so there isn't any additional register pressure. But I haven't looked extensively so may not be universally true.
My theory is that this could be better for the CPU because in this example once it gets back from PyLong_FromSsize_t, the jump target is already in a register and could maybe prefetch better.
Have not benchmarked anything yet.
Looking at another example _TAIL_CALL_BINARY_OP_SUBSCR_GETITEM, this does a LOAD_IP() towards the end so we have to reload our target address. It does seem like the optimizer is smart enough to avoid double loading, but this just ends up with an almost identical ending:
# main 28eba5: 41 0f b7 3f movzxedi, word ptr [r15] 28eba9: 40 0f b6 cf movzxecx, dil 28ebad: c1 ef 08shredi,0x8 28ebb0: 48 8d 1599 fa 1e 00leardx,[rip+0x1efa99] # 0x47e650 <INSTRUCTION_TABLE> 28ebb7: 4989 c4 movr12,rax 28ebba: ff 24 ca jmp qword ptr [rdx+8*rcx]# this branch 28f185: 41 0f b6 0f movzxecx, byte ptr [r15] 28f189: 48 8d 15 c0 04 1f 00leardx,[rip+0x1f04c0] # 0x47f650 <INSTRUCTION_TABLE> 28f190: 41 0f b6 7f 01movzxedi, byte ptr [r15+0x1] 28f195: 4989 c4 movr12,rax 28f198: ff 24 ca jmp qword ptr [rdx+8*rcx]I did this a bit half-hazardly through a combination of modifying macros and manual changes to anything that assigns to next_instr and a few special cases like exit_unwind that didn't fit. Could clean up with some direction.
One super naive metric is
# should be a tab after jmp llvm-objdump --x86-asm-syntax=intel -D python | grep 'jmp r' | wc -l which is 916 for this modification and 731 originally, so 185 more places where we jmp to a register instead of a computed address.