Skip to content

Conversation

@pablogsal
Copy link
Member

@pablogsalpablogsal commented Nov 24, 2025

@pablogsal
Copy link
MemberAuthor

pablogsal commented Nov 24, 2025

@brandtbucher can you take a look? This PR is trying to fix detection of incomplete samples by basically checking that when we unwind we actually reach the top. This is because if we catch the interpreter in inconsistent states we can reach a frame owned by the interpreter that looks like is the one in the top.

With this new field in the thread state we can easily check if we have successfully reached the topmost frame or reject the sample instead. I measured pyperformance and there is no measurable impact (as expected is just setting a pointer) and perf predicts the branch all the time.

Pyperformance run:

All benchmarks:

Benchmark2025-11-24_24-37-main-d855ff54466f2025-11-22_21-59-main-227b9d326ec7
subparsers97.8 ms100 ms: 1.02x slower
async_generators664 ms646 ms: 1.03x faster
bpe_tokeniser7.06 sec6.87 sec: 1.03x faster
comprehensions26.6 us27.3 us: 1.03x slower
coverage131 ms134 ms: 1.02x slower
crypto_pyaes124 ms121 ms: 1.02x faster
deepcopy_reduce6.76 us6.94 us: 1.03x slower
fannkuch640 ms656 ms: 1.02x slower
generators44.4 ms45.4 ms: 1.02x slower
logging_format17.0 us17.4 us: 1.02x slower
logging_simple14.8 us15.2 us: 1.03x slower
mdp2.13 sec2.09 sec: 1.02x faster
nbody180 ms176 ms: 1.02x faster
pickle_pure_python630 us647 us: 1.03x slower
regex_compile223 ms227 ms: 1.02x slower
regex_dna262 ms255 ms: 1.03x faster
regex_effbot4.34 ms4.23 ms: 1.03x faster
scimark_monte_carlo110 ms108 ms: 1.02x faster
scimark_sor184 ms188 ms: 1.02x slower
spectral_norm150 ms154 ms: 1.03x slower
unpack_sequence71.1 ns69.2 ns: 1.03x faster
xdsl_constant_fold95.4 ms97.6 ms: 1.02x slower
xml_etree_generate144 ms148 ms: 1.03x slower
xml_etree_process105 ms107 ms: 1.02x slower
Geometric mean(ref)1.00x slower

Benchmark hidden because not significant (85): 2to3, many_optionals, async_tree_none, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_cpu_io_mixed, async_tree_eager_cpu_io_mixed_tg, async_tree_eager_io, async_tree_eager_io_tg, async_tree_eager_memoization, async_tree_eager_memoization_tg, async_tree_eager_tg, async_tree_io, async_tree_io_tg, async_tree_memoization, async_tree_memoization_tg, async_tree_none_tg, asyncio_tcp, asyncio_tcp_ssl, asyncio_websockets, chameleon, chaos, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, deltablue, django_template, docutils, dulwich_log, float, create_gc_cycles, gc_traversal, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_silent, mako, meteor_contest, nqueens, pathlib, pickle, pickle_dict, pickle_list, pidigits, pprint_safe_repr, pprint_pformat, pyflate, python_startup, python_startup_no_site, raytrace, regex_v8, richards, richards_super, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sphinx, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_v2_normalize, sqlglot_v2_optimize, sqlglot_v2_parse, sqlglot_v2_transpile, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tomli_loads, tornado_http, typing_runtime_protocols, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse

Could you check if I am missing something?

@pablogsalpablogsalforce-pushed the gh-138122-fix branch 4 times, most recently from 42921cb to d855ff5CompareNovember 24, 2025 18:23
markshannon
markshannon previously requested changes Nov 25, 2025
Copy link
Member

@markshannonmarkshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have a pointer to the lowest frame (stack grows up) in the stack, each thread state could have an embedded base frame to start the stack with:
tstate->current_frame = &tstate->base_frame in thread state initialization.

Then use &tstate->base_frame instead of tstate->entry_frame when needed. It adds a bit of extra work for each thread creation, but doesn't add overhead when entering the interpreter loop. The stack walk in the GC might also need updating.

Also could you add a note to InternalDocs/frame.md about how this works?

@bedevere-app
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal
Copy link
MemberAuthor

pablogsal commented Nov 25, 2025

Rather than have a pointer to the lowest frame (stack grows up) in the stack, each thread state could have an embedded base frame to start the stack with:

tstate->current_frame = &tstate->base_frame in thread state initialization.

Then use &tstate->base_frame instead of tstate->entry_frame when needed. It adds a bit of extra work for each thread creation, but doesn't add overhead when entering the interpreter loop. The stack walk in the GC might also need updating.

If you feel strongly I can investigate this method but I think is worse than the current approach. The current approach is more generic because it's not prescriptive of what you put there as first frame and we don't need to update every consumer of this (like the gc or the thread state). The cost as proven is also inexistent: is a single write on the first entry and the last exit to the eval frame. You cannot measure even if you want. Perf literally had 0 samples on this line even sampling at 200kHz. The code is more clean in my opinion as it's literally the two writes you expect.

The other advantage of this method is that it doesn't affect existing profilers as the one you propose will require people to identify this entry frame as a special one and ignore it

Edit: I am trying to give you approach a go and I found some extra complications: we cannot easily 'embed' the frame in the thread state because that's a semi-public header and the interpreter frame struct is more private and that will force me to expose the header there and we don't want to do that. If we place it in other location other than the thread state then the profiler will need to copy more memory and will slow it down.

The other thing that makes this more annoying is that then the embedded frame cannot be copied using the same mechansm as other frames (the chunks) and we need to make a copy JUST for that frame which is an extra system call per sample

Also could you add a note to InternalDocs/frame.md about how this works?

Ah yes excellent point. Will update once we settle on the approach

…ame unwinding Add PyThreadState.entry_frame to track the bottommost frame in each thread. The profiler validates unwinding reached this frame, rejecting incomplete samples caused by race conditions or errors. Also fix unwinding to skip suspended generator frames which have NULL previous pointers.
@pablogsal
Copy link
MemberAuthor

pablogsal commented Nov 25, 2025

@markshannon I have implemented a POC your suggestion in 111e70c just so you can see what I meant. We can go with this if you feel really strongly but I still think that this is much more code and more tricky to get right (see #141912 (comment))

@pablogsalpablogsal changed the title gh-138122: Don't sample partial frame chains with suspended generatorsgh-138122: Don't sample partial frame chainsNov 25, 2025
@markshannon
Copy link
Member

It definitely is more code, but I still prefer it as it doesn't interfere with the normal execution of the interpreter.

Tools already need to skip entry frames, so wouldn't need to be changed if you reused FRAME_OWNED_BY_INTERPRETER instead of adding FRAME_OWNED_BY_THREAD_STATE. The GC code should also work unmodified in that case.

The other thing that makes this more annoying is that then the embedded frame cannot be copied using the same mechansm as other frames (the chunks) and we need to make a copy JUST for that frame which is an extra system call per sample

We could embed the frame in the first chunk. The first chunk already has to waste a bit of space, to make it appear to the chunk management code that it is never empty. Maybe putting a root frame there would clean up the code a bit.

Is there any reason why we can't change the implementation later?
If not, then maybe go ahead with your preferred approach and we can revisit this if/when we end up doing something like
#100987 (comment)

@pablogsal
Copy link
MemberAuthor

pablogsal commented Nov 27, 2025

It definitely is more code, but I still prefer it as it doesn't interfere with the normal execution of the interpreter.

Tools already need to skip entry frames, so wouldn't need to be changed if you reused FRAME_OWNED_BY_INTERPRETER instead of adding FRAME_OWNED_BY_THREAD_STATE. The GC code should also work unmodified in that case.

Ok let's go with your idea then. If it proves to be too annoying we can change the implementation later. I have pushed a commit to reuse FRAME_OWNED_BY_INTERPRETER as you suggested

@pablogsal
Copy link
MemberAuthor

I have made the requested changes; please review again.

@bedevere-app
Copy link

Thanks for making the requested changes!

@Fidget-Spinner: please review the changes made to this pull request.

@pablogsal
Copy link
MemberAuthor

I plan to land this in a couple of days unless someone wants extra time as I have implemented the requested changes that Mark proposed.

Resolved merge conflicts in: - Include/cpython/pystate.h: Keep both base_frame and last_profiled_frame fields - Include/internal/pycore_debug_offsets.h: Add debug offsets for both fields - InternalDocs/frames.md: Include documentation for both features - Modules/_remote_debugging/_remote_debugging.h: Updated process_frame_chain signature - Modules/_remote_debugging/frames.c: Merged base_frame validation with caching logic - Modules/_remote_debugging/threads.c: Integrated caching support with base_frame The base_frame feature (for stack validation) now coexists with the last_profiled_frame feature (for profiling cache optimization).
@pablogsalpablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 7, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit c9ab431 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F141912%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-botbedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Dec 7, 2025
@pablogsalpablogsal merged commit d6d850d into python:mainDec 7, 2025
116 of 119 checks passed
@pablogsalpablogsal deleted the gh-138122-fix branch December 7, 2025 15:53
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

@pablogsal@markshannon@bedevere-bot@Fidget-Spinner