Skip to content

Conversation

@ambrop72
Copy link

When destruction starts or the pool size is reduced, only the last worker thread could in fact stop, while the other workers that should stop would enter a busy wait state, since the condition they are waiting for via a condition variable would be satisfied. The busy waiting threads could starve the last thread from CPU, delaying downsizing or shutdown. This is particularly noticable when running via Valgrind; shutdown delays in the order of tens of seconds have been observed.

Fix it by making each thread stop as soon as it can. Doing this required adjusting the management of the workers vector:

  • When a worker is stopping, it may not yet be possible to downsize the vector (if it is not the last worker). In that case an invalid thread entry is left in the vector, and the vector will be downsized when possible as other threads stop.
  • When upsizing, some threads above the current limit may still be running (if there was a recent downsizing and they haven't stopped yet). Such threads just keep running.

When destruction starts or the pool size is reduced, only the last worker thread could in fact stop, while the other workers that should stop would enter a busy wait state, since the condition they are waiting for via a condition variable would be satisfied. The busy waiting threads could starve the last thread from CPU, delaying downsizing or shutdown. This is particularly noticable when running via Valgrind; shutdown delays in the order of tens of seconds have been observed. Fix it by making each thread stop as soon as it can. Doing this required adjusting the management of the workers vector: - When a worker is stopping, it may not yet be possible to downsize the vector (if it is not the last worker). In that case an invalid thread entry is left in the vector, and the vector will be downsized when possible as other threads stop. - When upsizing, some threads above the current limit may still be running (if there was a recent downsizing and they haven't stopped yet). Such threads just keep running.
@wilxwilx self-assigned this May 22, 2020
@wilxwilx requested review from wilx and removed request for wilxMay 29, 2020 17:57
@wilxwilx added the bug label May 29, 2020
@wilx
Copy link

wilx commented May 29, 2020

Do you have a test case that shows how it fails and how your patch fixes it? (Preferably independent of log4cplus.) I think I see it but it would be nice to have a test case.

@ambrop72
Copy link
Author

ambrop72 commented May 29, 2020

#include "ThreadPool.h" int main(){progschj::ThreadPool pool} 
g++ example.cpp -o example -lpthread 
$ time valgrind ./example ==31188== Memcheck, a memory error detector ==31188== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==31188== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==31188== Command: ./example ==31188== (gets stuck eating CPU, may terminate if given enough time) (CTRL+C to see a back stack trace) ==31188== Process terminating with default action of signal 2 (SIGINT) ==31188== at 0x4027D7: std::deque<std::function<void ()>, std::allocator<std::function<void ()> > >::empty() const (in /home/ambro/ThreadPool/example) ==31188== by 0x402187: std::queue<std::function<void ()>, std::deque<std::function<void ()>, std::allocator<std::function<void ()> > > >::empty() const (in /home/ambro/ThreadPool/example) ==31188== by 0x401A9F: progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}::operator()() const (in /home/ambro/ThreadPool/example) ==31188== by 0x40486B: void std::__invoke_impl<void, progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}>(std::__invoke_other, progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}&&) (in /home/ambro/ThreadPool/example) ==31188== by 0x404820: std::__invoke_result<progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}>::type std::__invoke<progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}>(std::__invoke_result&&, (progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}&&)...) (in /home/ambro/ThreadPool/example) ==31188== by 0x4047CD: void std::thread::_Invoker<std::tuple<progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (in /home/ambro/ThreadPool/example) ==31188== by 0x4047A3: std::thread::_Invoker<std::tuple<progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}> >::operator()() (in /home/ambro/ThreadPool/example) ==31188== by 0x404787: std::thread::_State_impl<std::thread::_Invoker<std::tuple<progschj::ThreadPool::emplace_back_worker(unsigned long)::{lambda()#1}> > >::_M_run() (in /home/ambro/ThreadPool/example) ==31188== by 0x493BB3F: ??? (in /nix/store/w7alid4mirzwx3ck4hj18q7rnr4yslfh-gcc-9.2.0-lib/lib/libstdc++.so.6.0.27) ==31188== by 0x484BEDC: start_thread (in /nix/store/9hy6c2hv8lcwc6clnc1p2jf09cs5q9dp-glibc-2.30/lib/libpthread-2.30.so) ==31188== by 0x4C98A4E: clone (in /nix/store/9hy6c2hv8lcwc6clnc1p2jf09cs5q9dp-glibc-2.30/lib/libc-2.30.so) 

With the fix, terminates normally in about a second.

@wilxwilx merged commit fc20cee into log4cplus:masterMay 29, 2020
wilx added a commit to log4cplus/log4cplus that referenced this pull request May 29, 2020
wilx added a commit to log4cplus/log4cplus that referenced this pull request May 29, 2020
fwsGonzo pushed a commit to fwsGonzo/ThreadPool that referenced this pull request Jan 10, 2023
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

@ambrop72@wilx