(Sorry, this is unfinished, I'll continue writing another time.)
The next thing to try in the Tier 2 interpreter is "trace stitching". (I'm sure there is talk about this in the faster-cpython/ideas repo, but I can't be bothered to look for it right now.)
This is also connected to gh-108866, which proposes that executors are required to make progress.
I think there are two somewhat different use cases (possibly more). The Tier 2 interpreter can stop (falling back to the Tier 1 interpreter) for several reasons:
- An exception was raised
- An
_EXIT_TRACE instruction is executed (there are several reasons why these are inserted into a trace, e.g. lack of "on-trace" confidence, an untranslatable Tier 1 instruction, or running out of buffer space during trace generation)
- A conditional branch takes the less likely path
- A guard fails (the
DEOPT_IF() macro)
The latter two are interesting, and different (even though they currently use the exact same exit path, via the deoptimize label).
Conditional branches
During trace creation, we use an estimation of whether the true or false path is take, and continue the trace at the more likely path (whether that's the branch target or not). Let's say we have a POP_JUMP_IF_TRUE instruction and the condition is most likely false. The trace then continues with the non-branching path (the instruction following POP_JUMP_IF_TRUE). When during trace execution the condition is found to be true (the less likely condition in this example), we currently exit to the jump target in the Tier 1 interpreter.
However, if the condition is true somewhat frequently (say, 40% of the time), we should eventually detect this as a "hot" exit. To detect this, we could have a counter associated with exiting the Tier 2 interpreter at this point. (Other strategies have also been proposed.) Once we decide the exit is "hot", we create a new trace, in this example starting at the branch target. This new executor can be inserted into the Tier 1 stream using an ENTER_EXECUTOR instruction at that target. Then, when the hot exit is taken the next time, we exit to the Tier 1 interpreter, immediately encounter the newly inserted ENTER_EXECUTOR instruction, and start executing the trace in that executor.
It would be nice if instead of bouncing through the Tier 1 interpreter, the Tier 2 interpreter would be able to just pass control to the new executor directly. This can likely be arranged using function pointers.
Let's analyze an example a bit more.
def test():
for i in range(2, 1000000):
if is_prime(i):
print(i)
Click here for details
The loop translates to:
L1: FOR_ITER 32 (to L3)
STORE_FAST 0 (i)
10 LOAD_GLOBAL 3 (is_prime + NULL)
LOAD_FAST 0 (i)
CALL 1
TO_BOOL
POP_JUMP_IF_TRUE 2 (to L2)
JUMP_BACKWARD 21 (to L1)
11 L2: LOAD_GLOBAL 5 (print + NULL)
LOAD_FAST 0 (i)
CALL 1
POP_TOP
JUMP_BACKWARD 34 (to L1)
After specialization it becomes:
L1: FOR_ITER_RANGE 32 (to L3)
STORE_FAST 0 (i)
10 LOAD_GLOBAL_MODULE 3 (is_prime + NULL)
LOAD_FAST 0 (i)
CALL_PY_EXACT_ARGS 1
TO_BOOL_BOOL
POP_JUMP_IF_TRUE 2 (to L2)
JUMP_BACKWARD 21 (to L1)
11 L2: LOAD_GLOBAL_BUILTIN 5 (print + NULL)
LOAD_FAST 0 (i)
CALL_BUILTIN_FAST_WITH_KEYWORDS 1
POP_TOP
JUMP_BACKWARD 34 (to L1)
When I turn on the Tier 2 translator and interpreter, I get the following trace (leaving out what happens in is_prime(), though we trace through it):
0 _SET_IP(13, 13, 0)
1 _ITER_CHECK_RANGE(32, 13, 0)
2 _GUARD_NOT_EXHAUSTED_RANGE(32, 13, 0)
3 _ITER_NEXT_RANGE(32, 13, 0)
4 _CHECK_VALIDITY(0, 15, 0)
5 STORE_FAST(0, 15, 0)
6 _GUARD_GLOBALS_VERSION(3, 16, 40)
7 _LOAD_GLOBAL_MODULE(3, 16, 10)
8 LOAD_FAST(0, 21, 0)
9 _SET_IP(22, 22, 0)
10 _CHECK_PEP_523(1, 22, 0)
11 _CHECK_FUNCTION_EXACT_ARGS(1, 22, 712)
12 _CHECK_STACK_SPACE(1, 22, 0)
13 _INIT_CALL_PY_EXACT_ARGS(1, 22, 0)
14 _SAVE_RETURN_OFFSET(4, 22, 0)
15 _PUSH_FRAME(1, 22, 0)
...
99 _POP_FRAME(0, 101, 0)
100 _CHECK_VALIDITY(0, 26, 0)
101 TO_BOOL_BOOL(0, 26, 0)
102 _GUARD_IS_FALSE_POP(512, 30, 0)
103 _JUMP_TO_TOP(0, 0, 0)
The key thing I'd like to call out is the _GUARD_IS_FALSE_POP opcode at instruction 102. When we get there with a prime number that instruction exits to the Tier 1 interpreter where we continue at L2.
Interestingly, because the body of the if is translated into a little block also ending in JUMP_BACKWARD, a second executor is created from that starting point. Unfortunately, it is nearly the same trace as before (though the path through my fake is_prime() is a little different), and because its ENTER_EXECUTOR instruction overwrites that second JUMP_BACKWARD, it is pretty ineffective. (Honestly, it looks like the trace almost always exits from inside the fake is_prime(), which is very branchy.)
It would be nicer if we could turn this little block into a separate trace:
11 L2: LOAD_GLOBAL_BUILTIN 5 (print + NULL)
LOAD_FAST 0 (i)
CALL_BUILTIN_FAST_WITH_KEYWORDS 1
POP_TOP
JUMP_BACKWARD 34 (to L1)
Unfortunately the Tier 2 translation of LOAD_GLOBAL_BUILTIN has several guards that may fail:
macro(LOAD_GLOBAL_BUILTIN) =
unused/1 + // Skip over the counter
_GUARD_GLOBALS_VERSION +
_GUARD_BUILTINS_VERSION +
_LOAD_GLOBAL_BUILTINS;
(For pragmatic reasons, even the "action" _LOAD_GLOBAL_BUILTINS uses DEOPT_IF()!)
So we're back balking at the requirement that executors must make progress. (Which is not actually implemented, but asserted by gh-108866.)
Linked PRs
(Sorry, this is unfinished, I'll continue writing another time.)
The next thing to try in the Tier 2 interpreter is "trace stitching". (I'm sure there is talk about this in the faster-cpython/ideas repo, but I can't be bothered to look for it right now.)
This is also connected to gh-108866, which proposes that executors are required to make progress.
I think there are two somewhat different use cases (possibly more). The Tier 2 interpreter can stop (falling back to the Tier 1 interpreter) for several reasons:
_EXIT_TRACEinstruction is executed (there are several reasons why these are inserted into a trace, e.g. lack of "on-trace" confidence, an untranslatable Tier 1 instruction, or running out of buffer space during trace generation)DEOPT_IF()macro)The latter two are interesting, and different (even though they currently use the exact same exit path, via the
deoptimizelabel).Conditional branches
During trace creation, we use an estimation of whether the true or false path is take, and continue the trace at the more likely path (whether that's the branch target or not). Let's say we have a
POP_JUMP_IF_TRUEinstruction and the condition is most likely false. The trace then continues with the non-branching path (the instruction followingPOP_JUMP_IF_TRUE). When during trace execution the condition is found to be true (the less likely condition in this example), we currently exit to the jump target in the Tier 1 interpreter.However, if the condition is true somewhat frequently (say, 40% of the time), we should eventually detect this as a "hot" exit. To detect this, we could have a counter associated with exiting the Tier 2 interpreter at this point. (Other strategies have also been proposed.) Once we decide the exit is "hot", we create a new trace, in this example starting at the branch target. This new executor can be inserted into the Tier 1 stream using an
ENTER_EXECUTORinstruction at that target. Then, when the hot exit is taken the next time, we exit to the Tier 1 interpreter, immediately encounter the newly insertedENTER_EXECUTORinstruction, and start executing the trace in that executor.It would be nice if instead of bouncing through the Tier 1 interpreter, the Tier 2 interpreter would be able to just pass control to the new executor directly. This can likely be arranged using function pointers.
Let's analyze an example a bit more.
Click here for details
The loop translates to:
After specialization it becomes:
When I turn on the Tier 2 translator and interpreter, I get the following trace (leaving out what happens in
is_prime(), though we trace through it):The key thing I'd like to call out is the
_GUARD_IS_FALSE_POPopcode at instruction 102. When we get there with a prime number that instruction exits to the Tier 1 interpreter where we continue atL2.Interestingly, because the body of the
ifis translated into a little block also ending inJUMP_BACKWARD, a second executor is created from that starting point. Unfortunately, it is nearly the same trace as before (though the path through my fakeis_prime()is a little different), and because itsENTER_EXECUTORinstruction overwrites that secondJUMP_BACKWARD, it is pretty ineffective. (Honestly, it looks like the trace almost always exits from inside the fakeis_prime(), which is very branchy.)It would be nicer if we could turn this little block into a separate trace:
Unfortunately the Tier 2 translation of
LOAD_GLOBAL_BUILTINhas several guards that may fail:(For pragmatic reasons, even the "action"
_LOAD_GLOBAL_BUILTINSusesDEOPT_IF()!)So we're back balking at the requirement that executors must make progress. (Which is not actually implemented, but asserted by gh-108866.)
Linked PRs
_GUARD_IS_TRUE_POPside-exits to target the next instruction, not themselves. #114078END_FORinstruction to only pop one value. #114247