Skip to content

Conversation

@colesbury
Copy link
Contributor

@colesburycolesbury commented Oct 25, 2024

These consist of a number of short snippets that help identify scaling bottlenecks in the free threaded interpreter.

The current bottlenecks are in calling functions in benchmarks that call functions (due to LOAD_ATTR not yet using deferred reference counting) and when accessing thread-local data.

These consist of a number of short snippets that help identify scaling bottlenecks in the free threaded interpreter. The current bottlenecks are in calling functions in benchmarks that call functions (due to `LOAD_ATTR` not yet using deferred reference counting) and when accessing thread-local data.
@colesbury
Copy link
ContributorAuthor

colesbury commented Oct 25, 2024

Some results below:

CPython 3.14t results
object_cfunction 1.1x faster cmodule_function 1.1x slower mult_constant 9.7x faster generator 9.5x faster pymethod 1.1x faster pyfunction 9.7x faster module_function 1.2x slower load_string_const 9.5x faster load_tuple_const 9.8x faster create_pyobject 9.5x faster create_closure 9.9x faster create_dict 9.8x faster thread_local_read 2.2x slower 
CPython 3.13t results
object_cfunction 9.8x faster cmodule_function 9.6x faster mult_constant 8.8x faster generator 9.5x faster pymethod 9.7x faster pyfunction 9.7x faster module_function 10.0x faster load_string_const 1.8x slower load_tuple_const 9.5x faster create_pyobject 9.5x faster create_closure 9.8x faster create_dict 8.8x faster thread_local_read 2.0x slower 
nogil fork (3.9) results
object_cfunction 10.4x faster cmodule_function 9.2x faster mult_constant 10.0x faster generator 9.0x faster pymethod 9.5x faster pyfunction 10.0x faster module_function 10.0x faster load_string_const 9.8x faster load_tuple_const 9.8x faster create_pyobject 9.6x faster create_closure 9.2x faster create_dict 9.6x faster thread_local_read 9.0x faster 

As mentioned in the PR description, we have known scaling issues related to LOAD_ATTR not using deferred reference counting yet. We also have a scaling issue when reading thread-local data -- we should probably enable deferred reference counting on _thread._local objects.

The 3.13 release avoids the LOAD_ATTR scaling issues due to immortalization. However, we apparently have a bug related to string immortalization (load_string_const is slow) and the thread-local bottleneck is also present.

Note that small variations (e.g. 8.8x vs. 10.4x) are not meaningful.

@colesburycolesbury marked this pull request as ready for review October 25, 2024 17:20
@colesburycolesbury requested review from Yhg1s and mpageOctober 25, 2024 17:20
@colesburycolesbury merged commit 00ea179 into python:mainOct 28, 2024
@colesburycolesbury deleted the gh-125985-ftscalingbench branch October 28, 2024 21:47
picnixz pushed a commit to picnixz/cpython that referenced this pull request Dec 8, 2024
…125986) These consist of a number of short snippets that help identify scaling bottlenecks in the free threaded interpreter. The current bottlenecks are in calling functions in benchmarks that call functions (due to `LOAD_ATTR` not yet using deferred reference counting) and when accessing thread-local data.
ebonnal pushed a commit to ebonnal/cpython that referenced this pull request Jan 12, 2025
…125986) These consist of a number of short snippets that help identify scaling bottlenecks in the free threaded interpreter. The current bottlenecks are in calling functions in benchmarks that call functions (due to `LOAD_ATTR` not yet using deferred reference counting) and when accessing thread-local data.
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

@colesbury@mpage@Yhg1s@tomasr8