Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
GH-135379: Top of stack caching for the JIT.#135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
markshannon commented Jun 13, 2025 • edited by bedevere-app bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-app bot
Uh oh!
There was an error while loading. Please reload this page.
Fidget-Spinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool. I'll do a full review soon enough.
Uh oh!
There was an error while loading. Please reload this page.
78489ea to 2850d72Comparemarkshannon commented Jun 20, 2025
Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall. The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. |
Fidget-Spinner commented Jun 20, 2025
Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug. |
Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst Outdated Show resolvedHide resolved
Uh oh!
There was an error while loading. Please reload this page.
Fidget-Spinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review the cases generator later.
Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst Outdated Show resolvedHide resolved
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Fidget-Spinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nits
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
markshannon commented Jun 26, 2025
Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum. |
Fidget-Spinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.
For decref elimination, there are two ways:
- Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
- Specialize for POP_TOP.
Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.
Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .
Uh oh!
There was an error while loading. Please reload this page.
brandtbucher commented Jun 26, 2025
I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.
I'm not sure that's the case. Yesterday, we benchmarked this approach together:
So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR. |
Fidget-Spinner commented Jun 26, 2025
@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL |
brandtbucher commented Jun 26, 2025
But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet). |
Fidget-Spinner commented Jun 26, 2025
I can't say for sure whether the effect is the same. |
Fidget-Spinner commented Jun 26, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment. The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me. |
Fidget-Spinner commented Jun 26, 2025
One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully. |
Fidget-Spinner commented Jun 26, 2025
@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:
All of them are standard benchmarking builds, Ie PGO, LTO, JIT. These are the results for nbody: So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system. |
Fidget-Spinner commented Jul 3, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
The Meta benchmarking results concur with my own results above:
Most importantly: |
Fidget-Spinner commented Aug 22, 2025
The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that. |
Fidget-Spinner commented Aug 22, 2025
Ok, I fixed the macOS CI on main. Please pull the changes in. |
markshannon commented Aug 22, 2025
I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything. So, I've reverted that change. Will rerun the benchmarks, to confirm... |
markshannon commented Oct 19, 2025
I was hoping to get a clear cross-the-board speedup before merging this. |
savannahostrowski left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.
Just one comment about the type change in pycore_optimizer.h
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
When you're done making the requested changes, leave the comment: |
Fidget-Spinner commented Nov 10, 2025
Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space. |
markshannon commented Nov 12, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I think we will want to vary the cache depending on both hardware and operating system. All that for later PR(s) though. |
Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.
7d47f13 to 4fec38dComparemarkshannon commented Dec 10, 2025
Apologies for the force push. This needed manually rebasing after the changes to the JIT frontend. |
Fidget-Spinner left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the code and understood it mostly as I had to rebase it a few times on my own branch.
These are the latest benchmarking results for this branch https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251210-3.15.0a2%2B-4b24c15-JIT/bm-20251210-vultr-x86_64-faster%252dcpython-tier_2_tos_caching-3.15.0a2%2B-4b24c15-vs-base.md
The 19% nbody speedup is still there, along with a modest ~7% speedup for richards as well.
So this bodes very well for the JIT. Considering it will unlock future optimizations in decref elimination, which would be very hard without this, we should just go ahead and merge it.
469f191 into python:mainUh oh!
There was an error while loading. Please reload this page.
Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.
The stats need fixing and the generated tables could be more compact, but it works.