Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34k
[3.11] gh-106883 Fix deadlock in threaded application#117332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
diegorusso commented Mar 28, 2024 • edited by bedevere-app bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by bedevere-app bot
Uh oh!
There was an error while loading. Please reload this page.
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. It has been suggested to disable GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
pablogsal commented Mar 28, 2024
This will break users that have disabled the gc, we need to do the dance only if it has not been disabled |
pablogsal commented Mar 28, 2024
Also there are return paths for errors where this leaves the gc disabled |
diegorusso commented Mar 28, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@pablogsal thanks for your comments! Of course this was more an "attempt" to check if the approach was in the right direction and to check if the C API I used are the correct ones. Next week I will produce a more complete patch. |
colesbury left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is the right general approach. I left some inline comments to expand on what @pablogsal wrote.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
When using threaded applications, there is a high risk of a deadlock in the intepreter. It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL. By disabling the GC during the _PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls fixes the issue.
diegorusso commented Apr 5, 2024
@pablogsal@colesbury I've updated the PR addressing your comments and adding a test to verify the patch. Please let me know what you think. |
diegorusso commented Apr 5, 2024
It is currently failing on the Address Sanitiser build, I'm investigating what the problem is. |
diegorusso commented Apr 9, 2024
I have pushed a new test for it. It is slower compare to the existent tests: it takes about 10 seconds on my machine (Parallels on MBP) when it succeeds. If it fails (it shouldn't) then it's the timeout of 40 seconds. Thoughts? |
pablogsal commented Apr 9, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
We have a thing to consider which is that 3.11.9 was the last bug fix release of 3.11 and that is already released. I could make an extra bug fix release to include a fix for this PR but technically this is too late. I will check with the other release managers to see what they think |
diegorusso commented Apr 9, 2024
Sure, no problem. Whatever the outcome is, it has been entertaining debugging it :) |
diegorusso commented May 27, 2024
I just want to give a gentle ping on this PR. What will be its fate? Merge or close? :) Joking apart, it would be nice to have a closure. |
pablogsal commented May 28, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Unfortunately unless you want to argue that this is a security fix, we cannot backport bug fixes anymore to 3.11. If this issue is something that's affecting a considerable number of users I am willing to reconsider, though. But in any case the release will be folded with the next security release of 3.11. |
diegorusso commented May 28, 2024
Hello, I don't think I can argue that this is a security fix (it isn't :) ) but also I left a comment on dask/distributed#8616 issue as they bumped into the same problem. I requested them to comment directly here: if they have found a way to workaround the issue, we can just close the PR otherwise we need to understand what the blockers are and what's the magnitude of the issue. |
pablogsal commented May 28, 2024
tvalentyn commented Feb 12, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I work on a python library, and in the last ~2 months at least 3 of our customers have reached out for help debugging a deadlock situation. after much of very involved debugging, we attributed the issue to: #106883 . |
Yhg1s commented Mar 10, 2025
3.11 is yours, Pablo, so it's really up to you. I recall working around this issue at Google, and while I think fixing it is worth while, as RM I'm not sure fixing it in 3.11, at this point is worth the risk... That said, the risk is very small, and it is the case that we're pushing for more use of threads (and testing with threads) with free-threaded Python on the horizon, and it would definitely be nice if threads were less flaky in 3.11... But that's thinking more with my free-threaded contributor hat than my RM hat :P |
pablogsal commented Mar 11, 2025
I just wanted to double check with you and other RMs. But I am convinced to backport this |
6b37486 into python:3.11Uh oh!
There was an error while loading. Please reload this page.
pablogsal commented Mar 11, 2025
merged, this will go in the next security release |
bedevere-bot commented Mar 11, 2025
|
bedevere-bot commented Mar 11, 2025
|
pablogsal commented Mar 11, 2025
Ugh, this is not good. @diegorusso do you mind checking? Otherwise we may need to revert :( |
bedevere-bot commented Mar 12, 2025
|
bedevere-bot commented Mar 12, 2025
|
bedevere-bot commented Mar 12, 2025
|
diegorusso commented Mar 12, 2025
I'm looking into it. |
pablogsal commented Mar 12, 2025
Technically by our policy we will need to revert if this is not fixed in 24h but given that we don't have a release soon, let's wait until Friday EOD max |
diegorusso commented Mar 13, 2025
I plan to make a new PR today. |
diegorusso commented Mar 13, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Upon an initial investigation, the culprit is the low timeout set in the test: increasing the timeout makes the test passing. If we look at the builds related to this commit https://buildbot.python.org/#/changes/40264 we notice that all the failing ones related to this test have been built with On GitHub tests are passing because we build CPython without debug. As a reference on my AArch64 Linux VM the execution time difference of
Another data point. On a software emulated environment:
There is about a 4x difference between the build with debug and without. I suggest to increase the timeout to a level that it is enough for all the platforms, even the older ones. |
diegorusso commented Mar 13, 2025
PR has been raised: #131182 |
When using threaded applications, there is a high risk of a deadlock in the intepreter.
It's a lock ordering deadlock with HEAD_LOCK(&_PyRuntime); and the GIL.
It has been suggested to disable GC during the
_PyThread_CurrentFrames() and _PyThread_CurrentExceptions() calls.
sys._current_frames#106883