gh-138122: Don't sample partial frame chains#141912

pablogsal · 2025-11-24T16:06:08Z

Issue: Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122

pablogsal · 2025-11-24T18:06:25Z

@brandtbucher can you take a look? This PR is trying to fix detection of incomplete samples by basically checking that when we unwind we actually reach the top. This is because if we catch the interpreter in inconsistent states we can reach a frame owned by the interpreter that looks like is the one in the top.

With this new field in the thread state we can easily check if we have successfully reached the topmost frame or reject the sample instead. I measured pyperformance and there is no measurable impact (as expected is just setting a pointer) and perf predicts the branch all the time.

Pyperformance run:

All benchmarks:

Benchmark	2025-11-24_24-37-main-d855ff54466f	2025-11-22_21-59-main-227b9d326ec7
subparsers	97.8 ms	100 ms: 1.02x slower
async_generators	664 ms	646 ms: 1.03x faster
bpe_tokeniser	7.06 sec	6.87 sec: 1.03x faster
comprehensions	26.6 us	27.3 us: 1.03x slower
coverage	131 ms	134 ms: 1.02x slower
crypto_pyaes	124 ms	121 ms: 1.02x faster
deepcopy_reduce	6.76 us	6.94 us: 1.03x slower
fannkuch	640 ms	656 ms: 1.02x slower
generators	44.4 ms	45.4 ms: 1.02x slower
logging_format	17.0 us	17.4 us: 1.02x slower
logging_simple	14.8 us	15.2 us: 1.03x slower
mdp	2.13 sec	2.09 sec: 1.02x faster
nbody	180 ms	176 ms: 1.02x faster
pickle_pure_python	630 us	647 us: 1.03x slower
regex_compile	223 ms	227 ms: 1.02x slower
regex_dna	262 ms	255 ms: 1.03x faster
regex_effbot	4.34 ms	4.23 ms: 1.03x faster
scimark_monte_carlo	110 ms	108 ms: 1.02x faster
scimark_sor	184 ms	188 ms: 1.02x slower
spectral_norm	150 ms	154 ms: 1.03x slower
unpack_sequence	71.1 ns	69.2 ns: 1.03x faster
xdsl_constant_fold	95.4 ms	97.6 ms: 1.02x slower
xml_etree_generate	144 ms	148 ms: 1.03x slower
xml_etree_process	105 ms	107 ms: 1.02x slower
Geometric mean	(ref)	1.00x slower

Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse

Could you check if I am missing something?

Modules/_remote_debugging_module.c

markshannon

Rather than have a pointer to the lowest frame (stack grows up) in the stack, each thread state could have an embedded base frame to start the stack with:
tstate->current_frame = &tstate->base_frame in thread state initialization.

Then use &tstate->base_frame instead of tstate->entry_frame when needed. It adds a bit of extra work for each thread creation, but doesn't add overhead when entering the interpreter loop. The stack walk in the GC might also need updating.

Also could you add a note to InternalDocs/frame.md about how this works?

Include/cpython/pystate.h

Modules/_remote_debugging_module.c

Python/ceval.c

bedevere-app · 2025-11-25T08:48:11Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

pablogsal · 2025-11-25T11:12:33Z

Rather than have a pointer to the lowest frame (stack grows up) in the stack, each thread state could have an embedded base frame to start the stack with:
tstate->current_frame = &tstate->base_frame in thread state initialization.
Then use &tstate->base_frame instead of tstate->entry_frame when needed. It adds a bit of extra work for each thread creation, but doesn't add overhead when entering the interpreter loop. The stack walk in the GC might also need updating.

If you feel strongly I can investigate this method but I think is worse than the current approach. The current approach is more generic because it's not prescriptive of what you put there as first frame and we don't need to update every consumer of this (like the gc or the thread state). The cost as proven is also inexistent: is a single write on the first entry and the last exit to the eval frame. You cannot measure even if you want. Perf literally had 0 samples on this line even sampling at 200kHz. The code is more clean in my opinion as it's literally the two writes you expect.

The other advantage of this method is that it doesn't affect existing profilers as the one you propose will require people to identify this entry frame as a special one and ignore it

Edit: I am trying to give you approach a go and I found some extra complications: we cannot easily 'embed' the frame in the thread state because that's a semi-public header and the interpreter frame struct is more private and that will force me to expose the header there and we don't want to do that. If we place it in other location other than the thread state then the profiler will need to copy more memory and will slow it down.

The other thing that makes this more annoying is that then the embedded frame cannot be copied using the same mechansm as other frames (the chunks) and we need to make a copy JUST for that frame which is an extra system call per sample

Also could you add a note to InternalDocs/frame.md about how this works?

Ah yes excellent point. Will update once we settle on the approach

…ame unwinding Add PyThreadState.entry_frame to track the bottommost frame in each thread. The profiler validates unwinding reached this frame, rejecting incomplete samples caused by race conditions or errors. Also fix unwinding to skip suspended generator frames which have NULL previous pointers.

pablogsal · 2025-11-25T15:04:30Z

@markshannon I have implemented a POC your suggestion in 111e70c just so you can see what I meant. We can go with this if you feel really strongly but I still think that this is much more code and more tricky to get right (see #141912 (comment))

markshannon · 2025-11-25T22:05:47Z

It definitely is more code, but I still prefer it as it doesn't interfere with the normal execution of the interpreter.

Tools already need to skip entry frames, so wouldn't need to be changed if you reused FRAME_OWNED_BY_INTERPRETER instead of adding FRAME_OWNED_BY_THREAD_STATE. The GC code should also work unmodified in that case.

The other thing that makes this more annoying is that then the embedded frame cannot be copied using the same mechansm as other frames (the chunks) and we need to make a copy JUST for that frame which is an extra system call per sample

We could embed the frame in the first chunk. The first chunk already has to waste a bit of space, to make it appear to the chunk management code that it is never empty. Maybe putting a root frame there would clean up the code a bit.

Is there any reason why we can't change the implementation later?
If not, then maybe go ahead with your preferred approach and we can revisit this if/when we end up doing something like
#100987 (comment)

pablogsal · 2025-11-27T20:50:23Z

It definitely is more code, but I still prefer it as it doesn't interfere with the normal execution of the interpreter.
Tools already need to skip entry frames, so wouldn't need to be changed if you reused FRAME_OWNED_BY_INTERPRETER instead of adding FRAME_OWNED_BY_THREAD_STATE. The GC code should also work unmodified in that case.

Ok let's go with your idea then. If it proves to be too annoying we can change the implementation later. I have pushed a commit to reuse FRAME_OWNED_BY_INTERPRETER as you suggested

Implemented Mark's idea

pablogsal · 2025-11-30T02:05:59Z

I have made the requested changes; please review again.

bedevere-app · 2025-11-30T02:06:04Z

Thanks for making the requested changes!

@Fidget-Spinner: please review the changes made to this pull request.

pablogsal · 2025-12-05T15:57:40Z

I plan to land this in a couple of days unless someone wants extra time as I have implemented the requested changes that Mark proposed.

Resolved merge conflicts in: - Include/cpython/pystate.h: Keep both base_frame and last_profiled_frame fields - Include/internal/pycore_debug_offsets.h: Add debug offsets for both fields - InternalDocs/frames.md: Include documentation for both features - Modules/_remote_debugging/_remote_debugging.h: Updated process_frame_chain signature - Modules/_remote_debugging/frames.c: Merged base_frame validation with caching logic - Modules/_remote_debugging/threads.c: Integrated caching support with base_frame The base_frame feature (for stack validation) now coexists with the last_profiled_frame feature (for profiling cache optimization).

bedevere-bot · 2025-12-07T13:40:29Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit c9ab431 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F141912%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

pablogsal requested review from 1st1 and ambv as code owners November 24, 2025 16:06

bedevere-appbot added the awaiting core review label Nov 24, 2025

bedevere-appbot mentioned this pull request Nov 24, 2025
Implement PEP 799 – A dedicated profiling package for organizing Python profiling tools #138122
Open

pablogsal force-pushed the gh-138122-fix branch 2 times, most recently from 2c2f9b4 to bf6bad7Compare November 24, 2025 18:02

pablogsal requested review from ZeroIntensity, ericsnowcurrently and markshannon as code owners November 24, 2025 18:02

pablogsal force-pushed the gh-138122-fix branch 4 times, most recently from 42921cb to d855ff5Compare November 24, 2025 18:23

Fidget-Spinner approved these changes Nov 24, 2025
View reviewed changes

Modules/_remote_debugging_module.c Outdated Show resolvedHide resolved

bedevere-appbot added awaiting merge and removed awaiting core review labels Nov 24, 2025

markshannon previously requested changes Nov 25, 2025
View reviewed changes

Include/cpython/pystate.h Outdated Show resolvedHide resolved
Modules/_remote_debugging_module.c Outdated Show resolvedHide resolved
Python/ceval.c Outdated Show resolvedHide resolved
Python/ceval.c Outdated Show resolvedHide resolved

bedevere-appbot removed the awaiting merge label Nov 25, 2025

bedevere-appbot added the awaiting changes label Nov 25, 2025

pablogsal force-pushed the gh-138122-fix branch from d855ff5 to 7209306Compare November 25, 2025 13:28

Implement mark's idea
111e70c

pablogsal force-pushed the gh-138122-fix branch from 9380f61 to 111e70cCompare November 25, 2025 15:05

Add note to the docs
664d439

pablogsal requested a review from AA-Turner as a code owner November 25, 2025 15:15

Fix frame termination
29a04b7

pablogsal force-pushed the gh-138122-fix branch from 68bfbfe to 29a04b7Compare November 25, 2025 15:22

pablogsal changed the title ~~gh-138122: Don't sample partial frame chains with suspended generators~~gh-138122: Don't sample partial frame chainsNov 25, 2025

Fix assignment
3d9c294

ReuseFRAME_OWNED_BY_INTERPRETER
e14dc8e

pablogsal requested a review from markshannon November 27, 2025 20:50

pablogsal removed the awaiting changes label Nov 30, 2025

bedevere-appbot added the awaiting change review label Nov 30, 2025

bedevere-appbot requested a review from Fidget-Spinner November 30, 2025 02:06

pablogsal added 3 commits December 6, 2025 23:50

Fix thread cache test
13bd7e4

Merge upstream/main intopythongh-138122-fix
a01bd24

pablogsal force-pushed the gh-138122-fix branch from 9f1e933 to a01bd24Compare December 7, 2025 02:47

pablogsal removed the awaiting change review label Dec 7, 2025

Fix warnings
c9ab431

pablogsal force-pushed the gh-138122-fix branch from c295209 to c9ab431Compare December 7, 2025 13:08

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 7, 2025

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 7, 2025

pablogsal merged commit d6d850d into python:mainDec 7, 2025
116 of 119 checks passed

pablogsal deleted the gh-138122-fix branch December 7, 2025 15:53

Uh oh!

gh-138122: Don't sample partial frame chains#141912

gh-138122: Don't sample partial frame chains #141912

Conversation

pablogsal commented Nov 24, 2025• edited by bedevere-app botLoading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Nov 24, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All benchmarks:

Uh oh!

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-appbot commented Nov 25, 2025

Uh oh!

pablogsal commented Nov 25, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Nov 25, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Nov 25, 2025

Uh oh!

pablogsal commented Nov 27, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Nov 30, 2025

Uh oh!

bedevere-appbot commented Nov 30, 2025

Uh oh!

pablogsal commented Dec 5, 2025

Uh oh!

bedevere-bot commented Dec 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pablogsal commented Nov 24, 2025•
edited by bedevere-app bot
Loading

pablogsal commented Nov 24, 2025•
edited
Loading

pablogsal commented Nov 25, 2025•
edited
Loading

pablogsal commented Nov 25, 2025•
edited
Loading

pablogsal commented Nov 27, 2025•
edited
Loading