GH-135379: Top of stack caching for the JIT.#135465

markshannon · 2025-06-13T13:11:56Z

The stats need fixing and the generated tables could be more compact, but it works.

Issue: Top-of-stack caching in the JIT #135379

Fidget-Spinner

This is really cool. I'll do a full review soon enough.

Python/optimizer.c

markshannon · 2025-06-20T13:15:39Z

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Fidget-Spinner · 2025-06-20T13:21:10Z

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst

Fidget-Spinner

I need to review the cases generator later.

Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst

Python/optimizer.c

Tools/cases_generator/analyzer.py

Fidget-Spinner

Stylistic nits

Tools/cases_generator/tier2_generator.py

Tools/cases_generator/analyzer.py

markshannon · 2025-06-26T13:33:52Z

Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum.

Fidget-Spinner

I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.

For decref elimination, there are two ways:

Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
Specialize for POP_TOP.

Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .

Tools/cases_generator/analyzer.py

brandtbucher · 2025-06-26T16:02:29Z

I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster nbody on most platforms, and everything else in the noise" really just doesn't feel like enough. If this is really a good approach to register allocation, we should be able to see a general pattern of improvement for multiple JIT-heavy benchmarks. That could mean more tweaking of magic numbers, or a change in approach.

I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations.

I'm not sure that's the case. Yesterday, we benchmarked this approach together:

Just decref elimination: 7% faster nbody
This PR: 13% faster nbody
Both PRs together: 9% faster nbody

So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR.

Fidget-Spinner · 2025-06-26T16:12:20Z

@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL

brandtbucher · 2025-06-26T16:27:39Z

But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet).

Fidget-Spinner · 2025-06-26T16:31:25Z

I can't say for sure whether the effect is the same.

Fidget-Spinner · 2025-06-26T16:55:39Z

One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment.

The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me.

Fidget-Spinner · 2025-06-26T17:10:11Z

One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking

On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully.

Fidget-Spinner · 2025-06-26T17:47:09Z

@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:

Base bda1218 (the reference)
Decref
tos caching
tos caching + decref

All of them are standard benchmarking builds, Ie PGO, LTO, JIT.

These are the results for nbody:

Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-decref-bda121862] 102 ms +- 1 ms: 1.03x faster Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-bda121862] 91.1 ms +- 0.5 ms: 1.16x faster Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-decref-bda121862] 85.8 ms +- 0.4 ms: 1.23x faster

So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system.

Fidget-Spinner · 2025-07-03T09:26:11Z

The Meta benchmarking results concur with my own results above:

rebased (TOS caching) - https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20250624-3.15.0a0-50081dc-JIT
rebased-mine (TOS caching + refcount for nbody) - https://github.com/facebookexperimental/free-threading-benchmarking/tree/main/results/bm-20250625-3.15.0a0-992ac08-JIT

Most importantly:
nbody on TOS caching only: 1.18x faster
nbody on TOS caching+refcount: 1.21x faster
So refcount indeed improves on top of TOS caching on the meta runners

Fidget-Spinner · 2025-08-22T07:46:07Z

The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that.

Fidget-Spinner · 2025-08-22T12:16:46Z

Ok, I fixed the macOS CI on main. Please pull the changes in.

markshannon · 2025-08-22T17:19:03Z

I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything.
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250822-3.15.0a0-efe4628-JIT

So, I've reverted that change.

Will rerun the benchmarks, to confirm...

markshannon · 2025-10-19T08:05:18Z

I was hoping to get a clear cross-the-board speedup before merging this.
But as it also enables decref removal and fixes the problem of intermediate values on the stack during tracing (#140277) I think we should merge this soon and tweak it later.

savannahostrowski

Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.

Just one comment about the type change in pycore_optimizer.h

Include/internal/pycore_optimizer.h

Tools/cases_generator/analyzer.py

bedevere-app · 2025-11-09T22:12:10Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Fidget-Spinner · 2025-11-10T00:46:05Z

Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space.

markshannon · 2025-11-12T13:23:45Z

I think we will want to vary the cache depending on both hardware and operating system.
We will want to vary it for different hardware, due to differing numbers of available registers.
We will want to vary for operating system, as different OSes have different calling conventions, changing the number of available registers.

All that for later PR(s) though.

Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.

markshannon · 2025-12-10T17:04:57Z

Apologies for the force push. This needed manually rebasing after the changes to the JIT frontend.

Fidget-Spinner

I've read the code and understood it mostly as I had to rebase it a few times on my own branch.

These are the latest benchmarking results for this branch https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251210-3.15.0a2%2B-4b24c15-JIT/bm-20251210-vultr-x86_64-faster%252dcpython-tier_2_tos_caching-3.15.0a2%2B-4b24c15-vs-base.md

The 19% nbody speedup is still there, along with a modest ~7% speedup for richards as well.

So this bodes very well for the JIT. Considering it will unlock future optimizations in decref elimination, which would be very hard without this, we should just go ahead and merge it.

Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.

bedevere-appbot mentioned this pull request Jun 13, 2025
Top-of-stack caching in the JIT #135379
Closed

Fidget-Spinner reviewed Jun 13, 2025
View reviewed changes

Python/optimizer.c Outdated Show resolvedHide resolved

markshannon force-pushed the tier-2-tos-caching branch from 78489ea to 2850d72Compare June 19, 2025 14:49

markshannon marked this pull request as ready for review June 20, 2025 15:04

markshannon requested review from brandtbucher and savannahostrowski as code owners June 20, 2025 15:04

bedevere-appbot added the awaiting core review label Jun 20, 2025

Fidget-Spinner reviewed Jun 20, 2025
View reviewed changes

Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst Outdated Show resolvedHide resolved

Fidget-Spinner reviewed Jun 20, 2025
View reviewed changes

Fidget-Spinner mentioned this pull request Jun 23, 2025
gh-134584: Eliminate redundant refcounting from _CALL_TYPE_1#135818
Merged

Fidget-Spinner reviewed Jun 24, 2025
View reviewed changes

Tools/cases_generator/analyzer.py Outdated Show resolvedHide resolved

Fidget-Spinner reviewed Jun 24, 2025
View reviewed changes

Tools/cases_generator/tier2_generator.py Outdated Show resolvedHide resolved
Tools/cases_generator/tier2_generator.py Outdated Show resolvedHide resolved
Tools/cases_generator/analyzer.pyShow resolvedHide resolved

Fidget-Spinner approved these changes Jun 26, 2025
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting core review labels Jun 26, 2025

efimov-mikhail reviewed Jun 26, 2025
View reviewed changes

Tools/cases_generator/analyzer.py Outdated Show resolvedHide resolved

MEDX000 added this to MEDX-arabic-programme Jun 26, 2025

github-project-automationbot moved this to Todo in MEDX-arabic-programme Jun 26, 2025

markshannon requested a review from diegorusso as a code owner July 21, 2025 09:46

bedevere-appbot added the awaiting merge label Aug 14, 2025

markshannon requested review from ZeroIntensity and ericsnowcurrently as code owners August 22, 2025 07:28

markshannon mentioned this pull request Oct 19, 2025
BINARY_OP_SUBSCR_GETITEM constituent uops are invalid #140277
Closed

savannahostrowski approved these changes Nov 3, 2025
View reviewed changes

Include/internal/pycore_optimizer.h Outdated Show resolvedHide resolved

Fidget-Spinner requested changes Nov 9, 2025
View reviewed changes

Tools/cases_generator/analyzer.py Outdated Show resolvedHide resolved

bedevere-appbot added awaiting changes and removed awaiting merge labels Nov 9, 2025

Tier 2 TOS caching
4fec38d
Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.

markshannon force-pushed the tier-2-tos-caching branch from 7d47f13 to 4fec38dCompare December 10, 2025 17:03

Merge branch 'main' into tier-2-tos-caching
4b24c15

markshannon requested a review from Fidget-Spinner December 10, 2025 17:44

Fidget-Spinner approved these changes Dec 10, 2025
View reviewed changes

bedevere-appbot added awaiting merge and removed awaiting changes labels Dec 10, 2025

Merge branch 'main' into tier-2-tos-caching
a54eca7

markshannon merged commit 469f191 into python:mainDec 11, 2025
78 checks passed

bedevere-appbot removed the awaiting merge label Dec 11, 2025

github-project-automationbot moved this from Todo to Done in MEDX-arabic-programme Dec 11, 2025

noamcohen97 mentioned this pull request Dec 11, 2025
gh-134584: Eliminate redundant refcounting from _CALL_TUPLE_1#135860
Merged

Fidget-Spinner mentioned this pull request Dec 16, 2025
JIT Planning for 3.15 and 3.16 #139038
Open
19 tasks

Uh oh!

GH-135379: Top of stack caching for the JIT.#135465

GH-135379: Top of stack caching for the JIT. #135465

Uh oh!

Conversation

markshannon commented Jun 13, 2025• edited by bedevere-app botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markshannon commented Jun 20, 2025

Uh oh!

Fidget-Spinner commented Jun 20, 2025

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markshannon commented Jun 26, 2025

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandtbucher commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

brandtbucher commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jul 3, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Aug 22, 2025

Uh oh!

Fidget-Spinner commented Aug 22, 2025

Uh oh!

markshannon commented Aug 22, 2025

Uh oh!

markshannon commented Oct 19, 2025

Uh oh!

savannahostrowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bedevere-appbot commented Nov 9, 2025

Uh oh!

Fidget-Spinner commented Nov 10, 2025

Uh oh!

markshannon commented Nov 12, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Dec 10, 2025

Uh oh!

Fidget-Spinner left a comment

markshannon commented Jun 13, 2025•
edited by bedevere-app bot
Loading

Fidget-Spinner commented Jun 26, 2025•
edited
Loading

Fidget-Spinner commented Jul 3, 2025•
edited
Loading

markshannon commented Nov 12, 2025•
edited
Loading