gh-107219: Fix concurrent.futures terminate_broken()#109244

vstinner · 2023-09-10T23:28:10Z

Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed.

Changes:

_ExecutorManagerThread.terminate_broken() now closes call_queue._writer.
multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation.

Issue: test_concurrent_futures.test_deadlock: test_crash_big_data() hangs randomly on Windows #107219

vstinner · 2023-09-10T23:37:01Z

@serhiy-storchaka @methane @ambv @gpshead @pitrou: Would you mind to have a look?

I would like to merge this fix as soon as possible since the bug #107219 is affecting very badly the Python workflow. The CI failure rate is very high because of this test_concurrent_futures.test_deadlock hang.

For now, I prefer to use WSA_OPERATION_ABORTED = 995 in Lib/multiprocessing/connection.py to ease backports. Later, I will try to add this constant somewhere :-) My first attempt to add it to the errno module didn't work (I didn't insist, I was working on the fix).

vstinner · 2023-09-10T23:41:50Z

With this change, I can no longer reproduce bug.

On my Windows VM which has 2 CPUs, I can easily reproduce the hang in around 30 seconds on the Python main branch:

Terminal 1: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
Terminal 2: python -m test -j2 -r

I stressed the test with:

Terminal 1, terminal 2 and terminal 3 (3 processes):
- python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=10
Terminal 4: python -m test test_concurrent_futures.test_deadlock -m test.test_concurrent_futures.test_deadlock.ProcessPoolSpawnExecutorDeadlockTest.test_crash_big_data --forever -v --timeout=30 -j2
Terminal 5: python -m test -j1 -r -u all

In 8 minutes, I failed to reproduce the bug anymore with this change.

Bonus: Moreover, I can no longer hang the test when I interrupt it with CTRL+C.

vstinner · 2023-09-10T23:53:39Z

Windows (x64) (pull_request) Successful

Oh! For the first time in like 2 weeks, test_concurrent_futures.test_deadlock did not hang in the GHA Windows x64 job!

Note: There are only these two unrelated failures:

2 re-run tests: test.test_asyncio.test_windows_events test.test_concurrent_futures.test_as_completed

These 2 tests passed when re-run in verbose mode (Result: FAILURE then SUCCESS).

vstinner · 2023-09-11T01:28:21Z

Lib/multiprocessing/connection.py

+ov=self._send_ov
+ifovisnotNone:
+# Interrupt WaitForMultipleObjects() in _send_bytes()
+ov.cancel()


asyncio uses a similar code in ProactorEventLoop:
cpython/Lib/asyncio/windows_events.py
Lines 67 to 81 in 1ec4537
def_cancel_overlapped(self):
ifself._ovisNone:
return
try:
self._ov.cancel()
exceptOSErrorasexc:
context={
'message': 'Cancelling an overlapped future failed',
'exception': exc,
'future': self,
}
ifself._source_traceback:
context['source_traceback'] =self._source_traceback
self._loop.call_exception_handler(context)
self._ov=None
asyncio uses more advanced code around to handle more cases. For example, in asyncio, the cancel() API is part of the public API.
Here the cancellation is a standard action in the Windows Overlapped API. The cancellation is synchronous, it's easy!
Hopefully, we are not in the very complicated RegisterWaitWithQueue() case! This case requires an asynchronous cancellation which is really complicated to handle: the completion of the cancellation should be awaited!? See this horror story: https://vstinner.github.io/asyncio-proactor-cancellation-from-hell.html

vstinner · 2023-09-11T01:33:10Z

Lib/multiprocessing/connection.py

+# close() was called by another thread while
+# WaitForMultipleObjects() was waiting for the overlapped
+# operation.
+raiseOSError(errno.EPIPE, "handle is closed")


I chose to raise a BrokenPipeError exception here, since Queue._feed() has a special code path for that to ignore EPIPE errors silently:
cpython/Lib/multiprocessing/queues.py
Lines 255 to 257 in 1ec4537
exceptExceptionase:
ifignore_epipeandgetattr(e, 'errno', 0) ==errno.EPIPE:
return
And concurrent.futures uses this code path for its "call queue" which is causing troubles here:
cpython/Lib/concurrent/futures/process.py
Lines 724 to 732 in 1ec4537
self._call_queue=_SafeQueue(
max_size=queue_size, ctx=self._mp_context,
pending_work_items=self._pending_work_items,
shutdown_lock=self._shutdown_lock,
thread_wakeup=self._executor_manager_thread_wakeup)
# Killed worker processes can produce spurious "broken pipe"
# tracebacks in the queue's own worker thread. But we detect killed
# processes anyway, so silence the tracebacks.
self._call_queue._ignore_epipe=True

sounds like we got lucky that callers were handling one thing we could raise! :)

At the beginning, I started by adding a new exception. But I chose to reuse the existing code instead. IMO BrokenPipeError perfectly makes sense for a PipeConnection.

serhiy-storchaka

LGTM.

But I have one suggestion and one question/suggestion.

serhiy-storchaka · 2023-09-11T07:04:58Z

Lib/multiprocessing/connection.py

 finally:
+self._send_ov=None
 nwritten, err=ov.GetOverlappedResult(True)
+iferr==WSA_OPERATION_ABORTED:


What other value can it be? There is assert err == 0 below, so I guess that any error was unexpected.
Could we simply check that err is not zero here?

I chose to write a minimalist change: change at least code as possible. I introduce one new error, I added a check for this error, and that's all. I don't know the code enough to answer to your question. I'm not a multiprocessing or Windows API expert at all :-(

Lib/multiprocessing/connection.py

Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation.

Address Serhiy's review.

Lib/multiprocessing/connection.py

serhiy-storchaka · 2023-09-11T08:11:04Z

Lib/multiprocessing/connection.py

 BUFSIZE=8192
 # A very generous timeout when it comes to local connections...
 CONNECTION_TIMEOUT=20.
+WSA_OPERATION_ABORTED=995


It is the same as _winapi.ERROR_OPERATION_ABORTED.

Now I'm confused. I don't recall which doc I was looking to. WriteFile() is documented to return ERROR_OPERATION_ABORTED when it's canceled: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile

miss-islington · 2023-09-11T08:11:35Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

bedevere-bot · 2023-09-11T08:11:36Z

There's a new commit after the PR has been approved.

@serhiy-storchaka: please review the changes made to this pull request.

…109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <vstinner@python.org>

bedevere-bot · 2023-09-11T08:11:47Z

GH-109254 is a backport of this pull request to the 3.12 branch.

bedevere-bot · 2023-09-11T08:11:56Z

GH-109255 is a backport of this pull request to the 3.11 branch.

vstinner · 2023-09-11T08:13:50Z

PR merged, thanks for the review @serhiy-storchaka.

I wanted to merge this fix ASAP since it prevented to merge others PRs.

…109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <vstinner@python.org>

serhiy-storchaka

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

serhiy-storchaka · 2023-09-11T08:16:35Z

Great work, @vstinner!

… (#109255) gh-107219: Fix concurrent.futures terminate_broken() (GH-109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner · 2023-09-11T21:05:30Z

According to the sources of GetOverlappedResult() in _winapi.c, the only value of err can be ERROR_SUCCESS (0), ERROR_MORE_DATA, ERROR_OPERATION_ABORTED, ERROR_IO_INCOMPLETE.

Well, if you're confident, you can modify the assert err == 0 in the code.

By the way, having nwritten, err = ov.GetOverlappedResult(True) in the finally: block sounds wrong to me. What if _winapi.WaitForMultipleObjects() raises an exception? Why is it important to call ov.GetOverlappedResult(True) in this case? But well, since I don't know the code, I prefer to not touch it!

Great work, @vstinner!

Thanks.

… (#109254) gh-107219: Fix concurrent.futures terminate_broken() (GH-109244) Fix a race condition in concurrent.futures. When a process in the process pool was terminated abruptly (while the future was running or pending), close the connection write end. If the call queue is blocked on sending bytes to a worker process, closing the connection write end interrupts the send, so the queue can be closed. Changes: * _ExecutorManagerThread.terminate_broken() now closes call_queue._writer. * multiprocessing PipeConnection.close() now interrupts WaitForMultipleObjects() in _send_bytes() by cancelling the overlapped operation. (cherry picked from commit a9b1f84) Co-authored-by: Victor Stinner <vstinner@python.org>

vstinner added needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Sep 10, 2023

bedevere-bot added the awaiting core review label Sep 10, 2023

bedevere-bot mentioned this pull request Sep 10, 2023
test_concurrent_futures.test_deadlock: test_crash_big_data() hangs randomly on Windows #107219
Closed

vstinner commented Sep 11, 2023
View reviewed changes

serhiy-storchaka self-requested a review September 11, 2023 06:47

serhiy-storchaka approved these changes Sep 11, 2023
View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 11, 2023

vstinner added 2 commits September 11, 2023 09:47

Remove PipeConnection.__init__()
069fbfa
Address Serhiy's review.

vstinner force-pushed the cf_termine_broken branch from 9987dc7 to 069fbfaCompare September 11, 2023 07:47

vstinner enabled auto-merge (squash) September 11, 2023 07:48

serhiy-storchaka reviewed Sep 11, 2023
View reviewed changes

Lib/multiprocessing/connection.py Outdated Show resolvedHide resolved

serhiy-storchaka reviewed Sep 11, 2023
View reviewed changes

vstinner merged commit a9b1f84 into python:mainSep 11, 2023

vstinner deleted the cf_termine_broken branch September 11, 2023 08:11

bedevere-bot removed the awaiting merge label Sep 11, 2023

bedevere-bot added the awaiting core review label Sep 11, 2023

bedevere-bot requested a review from serhiy-storchaka September 11, 2023 08:11

bedevere-bot removed the needs backport to 3.12 only security fixes label Sep 11, 2023

bedevere-bot removed the needs backport to 3.11 only security fixes label Sep 11, 2023

serhiy-storchaka reviewed Sep 11, 2023
View reviewed changes

vstinner mentioned this pull request Sep 11, 2023
gh-109162: libregrtest: move code around #109253
Merged

vstinner mentioned this pull request Sep 22, 2023
test_concurrent_futures.test_shutdown: test_interpreter_shutdown() fails randomly (race condition) #109047
Closed

colesbury mentioned this pull request Mar 24, 2025
test_multiprocessing_spawn.test_processes flaky tests on Windows #130733
Open

	def_cancel_overlapped(self):
	ifself._ovisNone:
	return
	try:
	self._ov.cancel()
	exceptOSErrorasexc:
	context={
	'message': 'Cancelling an overlapped future failed',
	'exception': exc,
	'future': self,
	}
	ifself._source_traceback:
	context['source_traceback'] =self._source_traceback
	self._loop.call_exception_handler(context)
	self._ov=None

	exceptExceptionase:
	ifignore_epipeandgetattr(e, 'errno', 0) ==errno.EPIPE:
	return

	self._call_queue=_SafeQueue(
	max_size=queue_size, ctx=self._mp_context,
	pending_work_items=self._pending_work_items,
	shutdown_lock=self._shutdown_lock,
	thread_wakeup=self._executor_manager_thread_wakeup)
	# Killed worker processes can produce spurious "broken pipe"
	# tracebacks in the queue's own worker thread. But we detect killed
	# processes anyway, so silence the tracebacks.
	self._call_queue._ignore_epipe=True

Uh oh!

gh-107219: Fix concurrent.futures terminate_broken()#109244

gh-107219: Fix concurrent.futures terminate_broken() #109244

Uh oh!

Conversation

vstinner commented Sep 10, 2023• edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Sep 10, 2023

Uh oh!

vstinner commented Sep 10, 2023

Uh oh!

vstinner commented Sep 10, 2023

Uh oh!

vstinnerSep 11, 2023

Choose a reason for hiding this comment

Uh oh!

vstinnerSep 11, 2023

Choose a reason for hiding this comment

Uh oh!

gpsheadSep 12, 2023

Choose a reason for hiding this comment

Uh oh!

vstinnerSep 12, 2023• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchakaSep 11, 2023

Choose a reason for hiding this comment

Uh oh!

vstinnerSep 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

serhiy-storchakaSep 11, 2023

Choose a reason for hiding this comment

Uh oh!

vstinnerSep 12, 2023• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miss-islington commented Sep 11, 2023

Uh oh!

bedevere-bot commented Sep 11, 2023

Uh oh!

bedevere-bot commented Sep 11, 2023

Uh oh!

bedevere-bot commented Sep 11, 2023

Uh oh!

vstinner commented Sep 11, 2023

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Sep 11, 2023

Uh oh!

vstinner commented Sep 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vstinner commented Sep 10, 2023•
edited by bedevere-bot
Loading

vstinnerSep 12, 2023•
edited
Loading

vstinnerSep 12, 2023•
edited
Loading