Skip to content

Conversation

@tom-pytel
Copy link
Contributor

@tom-pyteltom-pytel commented Mar 29, 2025

This PR exists because of #129069 and Tools/tsan/suppressions_free_threading.txt:

# List resizing happens through different paths ending in memcpy or memmove # (for efficiency), which will probably need to rewritten as explicit loops # of ptr-sized copies to be thread-safe. 

It seems to hit the requirements: performance, atomicity assured (per pointer), TSAN shuts up. Pretty sure can remove the list_ass_slice_lock_held and list_inplace_repeat_lock_held suppressions, and if not yet then should be able to add the atomic memcpy to list_resize where needed to be able to do so.

The "atomic memcpy" (atomic per ptr, not whole memcpy) functions are in the atomic wrappers header because they can be reused, if want otherwise though can move into listobject.c, or somewhere else. Can also make non-inline callable real functions. Or the inline funcs could go into pyatomic.h?

@colesbury, you may recognize this, I did something similar for the lock-free array module PR, but this is simpler. PyObject pointers are a bit more fragile than pure values though so I assume you want this here?

Here is an example of what the inner loop a FT_ATOMIC_MEMMOVE_PTR_RELAXED compiles to, its no rep movsq or other specialized move instructions, but its fast:

.L2971:# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR [rdi] # _160,* s# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143: for (void **d = (void **)dest + n,**s = (void **)src + n,**e = (void **)dest -1; d != e; d--, s--){subrsi,64 # d,# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143: for (void **d = (void **)dest + n,**s = (void **)src + n,**e = (void **)dest -1; d != e; d--, s--){subrdi,64 # s,# ./Include/cpython/pyatomic_gcc.h:509:{__atomic_store_n((void **)obj, value, __ATOMIC_RELAXED)}mov QWORD PTR 64[rsi],rax #,, _160movrax, QWORD PTR 56[rdi] # _160,mov QWORD PTR 56[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 48[rdi] # _160,mov QWORD PTR 48[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 40[rdi] # _160,mov QWORD PTR 40[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 32[rdi] # _160,mov QWORD PTR 32[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 24[rdi] # _160,mov QWORD PTR 24[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 16[rdi] # _160,mov QWORD PTR 16[rsi],rax #,, _160# ./Include/cpython/pyatomic_gcc.h:387:{return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED)}movrax, QWORD PTR 8[rdi] # _160,mov QWORD PTR 8[rsi],rax #,, _160cmp QWORD PTR 32[rsp],rsi # %sfp, djne .L2971 #,

Simple benchmark. Note the decrementing address copy case in FT_ATOMIC_MEMMOVE_PTR_RELAXED seems to hurt a bit cache-wise in the 'tins' simple bench, but it doesn't seem to make a difference in the large move or overall in pyperformance. I've had this one jump around from parity with current to this (which is the worst case so I put it here). The difference disappears if the real memmove is used in this case but the difference seems to come from the real memmove using special instructions (which are not atomic), a la: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. The other two simple (tdel and tins big) are solid where they are. Test script:

fromtimeitimporttimeitfromtimeimporttimetdel=timeit('del dst[:1024]', 'dst = [None] * 4096', number=10000000) print(f't del: {tdel}') tins=timeit('dst[:0] = src', 'dst = [None] * 4096; dst.pop(); src = [None]', number=100000) # pop() to make sure no realloc in timingprint(f't ins: {tins}') dst= [None] *0x10000000src= [None] dst.pop() t0=time() dst[:0] =srct1=time() print(f't ins big: {t1-t0}')

Times, average of 10 runs each:

current main: ------------- t del: 0.2191115931245804 t ins: 0.3050231862498549 t ins big: 0.0799584686756134 fix-129069: ----------- t del: 0.21582267370067712 t ins: 1.072859044800134 t ins big: 0.08154952526092529 

pyperformance benchmark. Difference in average performance is essentially nonexistent though individual tests can vary a bit (AMD 7950x running in VirtualBox):

 current main fix-129069 norm diff async_generators 250 254 -0.0157 asyncio_tcp 248 247 0.0040 asyncio_tcp_ssl 1.08 1.06 0.0189 asyncio_websockets 344 342 0.0058 chaos 42.3 42.4 -0.0024 comprehensions 11.1 11 0.0091 bench_mp_pool 17.5 17.4 0.0057 bench_thread_pool 1.16 1.09 0.0642 coroutines 12.8 12.8 0.0000 coverage 57 54.5 0.0459 crypto_pyaes 51.9 51.6 0.0058 deepcopy 182 180 0.0111 deepcopy_reduce 1.91 1.94 -0.0155 deepcopy_memo 20.2 21.7 -0.0691 deltablue 2.3 2.3 0.0000 django_template 25.2 25 0.0080 docutils 1.61 1.58 0.0190 dulwich_log 26.3 25.8 0.0194 fannkuch 260 257 0.0117 float 42.4 43.5 -0.0253 create_gc_cycles 494 512 -0.0352 gc_traversal 1.17 1.27 -0.0787 generators 20.9 21.4 -0.0234 genshi_text 16.6 16.7 -0.0060 genshi_xml 34.3 34.9 -0.0172 go 80.3 80.8 -0.0062 hexiom 4.17 4.22 -0.0118 html5lib 33.2 34 -0.0235 json_dumps 7.43 7.31 0.0164 json_loads 14.2 13.9 0.0216 logging_format 4.58 4.61 -0.0065 logging_silent 64.8 63.7 0.0173 logging_simple 4.2 4.22 -0.0047 mako 7.95 7.97 -0.0025 mdp 1.78 1.98 -0.1010 meteor_contest 68.2 69.7 -0.0215 nbody 82.6 78.8 0.0482 nqueens 58.6 61.1 -0.0409 pathlib 13.8 13.9 -0.0072 pickle 6.12 6.13 -0.0016 pickle_dict 15.3 15.2 0.0066 pickle_list 2.3 2.18 0.0550 pickle_pure_python 204 203 0.0049 pidigits 116 117 -0.0085 pprint_safe_repr 499 513 -0.0273 pprint_pformat 1.02 1.05 -0.0286 pyflate 296 304 -0.0263 python_startup 9.29 9.29 0.0000 python_startup_no_site 7.78 7.62 0.0210 raytrace 192 193 -0.0052 regex_compile 74.4 74.1 0.0040 regex_dna 111 110 0.0091 regex_effbot 1.69 1.7 -0.0059 regex_v8 13.8 13.7 0.0073 richards 32.1 32.4 -0.0093 richards_super 36.8 37.3 -0.0134 scimark_fft 232 231 0.0043 scimark_lu 81 80.7 0.0037 scimark_monte_carlo 45.9 47.9 -0.0418 scimark_sor 77.9 76.3 0.0210 scimark_sparse_mat_mult 3.83 3.78 0.0132 spectral_norm 68.5 67 0.0224 sqlglot_normalize 197 196 0.0051 sqlglot_optimize 36.8 36.9 -0.0027 sqlglot_parse 887 888 -0.0011 sqlglot_transpile 1.1 1.1 0.0000 sqlite_synth 1.43 1.44 -0.0069 sympy_expand 277 276 0.0036 sympy_integrate 12.8 12.7 0.0079 sympy_sum 88.1 85.3 0.0328 sympy_str 164 162 0.0123 telco 5.47 5.39 0.0148 tomli_loads 1.37 1.37 0.0000 typing_runtime_protocols 112 112 0.0000 unpack_sequence 28.7 28.6 0.0035 unpickle 8 8.18 -0.0220 unpickle_list 2.87 2.75 0.0436 unpickle_pure_python 144 140 0.0286 xml_etree_parse 74 72.6 0.0193 xml_etree_iterparse 49.6 47.8 0.0377 xml_etree_generate 56.2 56.9 -0.0123 xml_etree_process 40.4 40.3 0.0025 AVERAGE -0.0001 

@tom-pytel
Copy link
ContributorAuthor

Ping @colesbury , @Yhg1s , Windows doesn't like size_t as a parameter, will fix, but would like a thumbs up or down on this PR in general?

@tom-pyteltom-pytel changed the title gh-129069: make list ass_slice and memory_repeat safegh-129069: make list ass_slice and memory_repeat safe in free-threadingMar 29, 2025
@python-cla-bot
Copy link

All commit authors signed the Contributor License Agreement.

CLA signed

@tom-pytel
Copy link
ContributorAuthor

Thank you @mhsmith for the analysis, it led to the solution.

Specifically:

#include "pyatomic.h"
needs to be accomodated on ARM Android build (but apparently nowhere else, and I swear it compiled fine a couple weeks ago).

@kumaraditya303 I had to revert #131882 (comment) because _Py_IS_ALIGNED is not available in the above case. In fact I had to add #include <assert.h> to pyatomic.h in order to accomodate even direct alignment validation. Which leads me to a question, should we even bother with those assertions here (which necessitates that include)?

@mhsmith
Copy link
Member

As I explained in my previous comment, this is a macOS problem, not an Android one. If the PR was merged in its previous state, it would have failed on the macOS buildbots, unless they use the configure cache in the same unsafe way as GitHub Actions does.

@kumaraditya303kumaraditya303 added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 6, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit 9c4debd 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F131882%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-botbedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 6, 2025
@tom-pytel
Copy link
ContributorAuthor

WHY???

 extra_compile_args: [..., '-Werror=declaration-after-statement', '-DMODULE_NAME=_test_c99_cext', '-std=c99'] /Users/buildbot/buildarea/pull_request.pablogsal-macos-m1.macos-with-brew/build/Include/cpython/pyatomic.h:590:16: error: mixing declarations and code is incompatible with standards before C99 [-Werror,-Wdeclaration-after-statement] 590 | void **dest_ = (void **)dest + n; | ^ 1 error generated. 

@kumaraditya303kumaraditya303 added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 7, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit 32f9604 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F131882%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-botbedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 7, 2025
@colesbury
Copy link
Contributor

Hi @tom-pytel, thanks for your work on this and sorry for letting the PR languish. I've taken a slightly different approach in #142957 for the same issue.

@tom-pytel
Copy link
ContributorAuthor

Hi @tom-pytel, thanks for your work on this and sorry for letting the PR languish. I've taken a slightly different approach in #142957 for the same issue.

No worries, I'll close this out. But what should I do with #130771, close or is it still viable?

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

@tom-pytel@kumaraditya303@mhsmith@bedevere-bot@colesbury@Yhg1s@ZeroIntensity