Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34.3k
worker: refactor thread life cycle management#26099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
gireeshpunathil commented Feb 14, 2019 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
nodejs-github-bot commented Feb 14, 2019
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
gireeshpunathil commented Feb 15, 2019
@addaleax - One thing I noticed is that the |
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
addaleax commented Feb 15, 2019
Yes, it’s working as designed.
It’s a question of correctness – the |
Uh oh!
There was an error while loading. Please reload this page.
addaleax commented Feb 15, 2019
gireeshpunathil commented Feb 15, 2019
|
gireeshpunathil commented Feb 15, 2019
|
addaleax commented Feb 15, 2019
@gireeshpunathil Let me know if you need anything investigating those failures :) |
gireeshpunathil commented Feb 15, 2019
Program terminated with signal SIGSEGV, Segmentation fault. #0 __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:6767 ../nptl/pthread_mutex_lock.c: No such file or directory. [Current thread is 1 (Thread 0x3fffa1d54e60 (LWP 32165))] (gdb) where #0 __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67#1 0x000000001077a3d0 in ?? () at ../deps/uv/src/unix/thread.c:288#2 0x0000000010669de4 in node::worker::Worker::StopThread(v8::FunctionCallbackInfo<v8::Value> const&) () #3 0x00000000108dfc4c in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()#4 0x00000000108e09e4 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()#5 0x0000000011b5bc68 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()voidWorker::Exit(int code){Mutex::ScopedLock lock(mutex_); Debug(this, "Worker %llu called Exit(%d)", thread_id_, code); if (!thread_stopper_->IsStopped()){// --------------------------> here exit_code_ = code; Debug(this, "Received StopEventLoop request"); thread_stopper_->Stop(); if (isolate_ != nullptr) isolate_->TerminateExecution()} }Looking at this line (where it crashed), I believe that the main thread is trying to terminate the worker even before it came to life - which means Prior to this PR it was not an issue as we were using primitive fields under worker. Right now I am constructing I guess we could create it in the parent itself, and the async_handle can be late attached in the worker thread, of course. With that change I ran ppc linux 1000 times and see no crashes. @addaleax - what do you think? |
gireeshpunathil commented Feb 15, 2019
btw AIX also failed the same test, and I was anticipating it! |
gireeshpunathil commented Feb 15, 2019
new CI with the changes: https://ci.nodejs.org/job/node-test-pull-request/20794/ |
addaleax commented Feb 15, 2019
@gireeshpunathil Yes, that makes a lot of sense. Do these |
gireeshpunathil commented Feb 15, 2019
you mean to unwrap the |
gireeshpunathil commented Feb 15, 2019
btw wrote a new test that recreated the crash in xlinux with the old code and confirmed the theory. When the main thread went to terminate, the worker was still being cloned. (gdb) where #0 0x00007f1d0de23c30 in pthread_mutex_lock () from /lib64/libpthread.so.0#1 0x0000000001019239 in uv_mutex_lock (mutex=0x18) at ../deps/uv/src/unix/thread.c:287#2 0x0000000000d9b0a0 in node::LibuvMutexTraits::mutex_lock (mutex=0x18) at ../src/node_mutex.h:108#3 0x0000000000d9ce86 in node::MutexBase<node::LibuvMutexTraits>::ScopedLock::ScopedLock ( this=0x7ffe42fea980, mutex=...) at ../src/node_mutex.h:164#4 0x0000000000efd41a in node::worker::AsyncRequest::IsStopped (this=0x0) at ../src/node_worker.cc:87#5 0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1) at ../src/node_worker.cc:549#6 0x0000000000efff41 in node::worker::Worker::StopThread (args=...) at ../src/node_worker.cc:526#7 0x0000000001185b4d in v8::internal::FunctionCallbackArguments::Call ( this=this@entry=0x7ffe42feab90, handler=handler@entry=0x18d96d7f5391) at ../deps/v8/src/api-arguments-inl.h:140#8 0x0000000001186802 in v8::internal::(anonymous namespace)::HandleApiCallHelper<false> ( isolate=isolate@entry=0x4f6da90, function=..., function@entry=..., new_target=..., new_target@entry=..., fun_data=..., receiver=..., receiver@entry=..., args=...) at ../deps/v8/src/builtins/builtins-api.cc:109#9 0x000000000118ac2b in v8::internal::Builtin_Impl_HandleApiCall (args=..., isolate=isolate@entry=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:139#10 0x000000000118b641 in v8::internal::Builtin_HandleApiCall (args_length=5, args_object=0x7ffe42feadb8, isolate=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:127#11 0x000000000249ea95 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()#12 0x00000840f988cb8e in ?? ()#13 0x000004c1108025a1 in ?? ()#14 0x000018d96d7f5621 in ?? ()#15 0x0000000500000000 in ?? ()#16 0x000004c110802681 in ?? ()#17 0x00003a17f9dd3be9 in ?? () (gdb) f 5#5 0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1) at ../src/node_worker.cc:549549 if (!thread_stopper_->IsStopped()){(gdb) thr 8 [Switching to thread 8 (Thread 0x7f1d06ffd700 (LWP 42711))] #0 0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6 (gdb) where #0 0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6#1 0x00007f1d0de21d10 in ?? () from /lib64/libpthread.so.0#2 0x00007f1d06ffd700 in ?? ()#3 0x0000000000000000 in ?? () (gdb) |
addaleax commented Feb 15, 2019
Sorry for being unclear – what I meant was to use |
gireeshpunathil commented Feb 16, 2019
thanks @addaleax - I followed that suggestion. Several test were crashing in windows. Here is what I found: 00007FF6A9505BE0 pushrbx00007FF6A9505BE2 subrsp,20h00007FF6A9505BE6 movrbx,qword ptr [rcx]00007FF6A9505BE9 movedx,0E0h00007FF6A9505BEE movrax,qword ptr [rbx]>> 00007FF6A9505BF1 dec dword ptr [rax+7D0h]00007FF6A9505BF7 movrax,qword ptr [rbx+10h]00007FF6A9505BFB mov qword ptr [rcx],rax00007FF6A9505BFE call operator delete (07FF6A93C1B4Ah) 00007FF6A9505C03 movedx,18h00007FF6A9505C08 movrcx,rbx00007FF6A9505C0B addrsp,20h00007FF6A9505C0F poprbxthis points to uv_close(reinterpret_cast<uv_handle_t*>(handle), [](uv_handle_t* handle){std::unique_ptr<CloseData> data{static_cast<CloseData*>(handle->data) }; data->env->handle_cleanup_waiting_--; handle->data = data->original_data; data->callback(reinterpret_cast<T*>(handle))});It is possible that when we issue the If I nullify the Looking for suggestions at this point! |
gireeshpunathil commented Feb 16, 2019
ok, I guess I found the issue - I was issuing |
addaleax commented Feb 16, 2019
@gireeshpunathil Yeah, I think that makes sense. :) |
0035d2c to f53a1e4CompareUh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
src/node_worker.h Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re not tracking the uv_async_t anymore, right? Maybe we should add something like tracker->TrackInlineField() that allows us to keep track of MemoryRetainers that are direct members of the class…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean - the one represented by thread_exit_async_ ? that is replaced withAsyncRequest objects that creates uv_async_t objects, and tracks through the interface method. the async_ field in AsyncRequest is still a pointer, direct member of neither AsyncRequest nor Worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither *thread_stopper_.async_ nor *on_thread_finished_.async_ are tracked, yes, because we don’t inform the tracker about the existence of the AsyncRequest fields.
Also, side note: I’m just noticing that we have the IsolateData and Environment fields listed here as well, which I’m not sure makes sense given that they are no longer directly allocated by this object…
gireeshpunathilFeb 17, 2019 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't understand. *thread_stopper_.async and *on_thread_finished_.async_ are not tracked through the tracker instance or of worker, but those are tracked through the tracker instance of AsyncRequest object (line 98):
voidAsyncRequest::MemoryInfo(MemoryTracker* tracker) const{Mutex::ScopedLock lock(mutex_); if (async_ != nullptr) tracker->TrackField("async_request", *async_)}Isn't it enough? I hope we don't need multiple trackers for the same allocation?
For the IsolateData and Environment: I just removed those from being actively tracked by the worker and pushed in under this PR itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil The problem is that the memory tracker doesn’t know that it should call AsyncRequest::MemoryInfo. Currently, the way to inform it would be adding tracker->TrackField("thread_stopper_", &thread_stopper_);, but then we would end up tracking the memory for the AsyncRequest itself twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil Should we change this PR to use TrackInlineField now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gireeshpunathil commented Feb 28, 2019
addaleax commented Feb 28, 2019
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into a LoopStopper abstraction that exposes routines to install a handle as well as to save a state. Refs: nodejs#21283
gireeshpunathil commented Mar 1, 2019
addaleax commented Mar 1, 2019
Landed in d14cba4 :) |
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into an AsyncRequest abstraction that exposes routines to install a handle as well as to save a state. PR-URL: #26099 Refs: #21283 Reviewed-By: Anna Henningsen <[email protected]>
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into an AsyncRequest abstraction that exposes routines to install a handle as well as to save a state. PR-URL: #26099 Refs: #21283 Reviewed-By: Anna Henningsen <[email protected]>
The current mechanism uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.
Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.
Refactor into a LoopStopper abstraction that exposes routines to
install a handle as well as to save a state.
Refs: #21283
The approach can be re-used for stopping the main Node application thread in-flight.
cc @addaleax @nodejs/workers
Checklist
make -j4 test(UNIX), orvcbuild test(Windows) passes