Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34.3k
cluster: fix broken/hanging tests#1934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
Olegas commented Jun 10, 2015
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wfm on linux, what system is this hanging on? why is this timeout needed? putting random timeouts in to solve race conditions is fragile, and in this case, I don't see what is being waited for... the worker is listening, the tcp connection is established... what's the pause for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, see this comment: #1400 (comment)
ADD: This is because of internal implementation details of SCHED_RR scheduling policy.
Again, this is not imaginary problem. See: #1896 (comment)
And I can reproduce this at least in my Mac OS from current master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Olegas I didn't say it was imaginary! I said that inserting arbitrary timeouts to solve undescribed race conditions is poor practice. You didn't answer my question: what is the race condition?
Can you describe what ordering of scheduling events is causing this to break on OS X? Usually, rather than timeouts, its possible to explicitly wait for precise events to avoid race conditions and timeouts. If node lacks such events, then timeouts are our last resort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed my previous comment. I found an error in my explanation.
jbergstroem commented Jun 10, 2015
This test has been timing out on multiple buildbots on several occasions as of late. |
sam-github commented Jun 11, 2015
@jbergstroem Any idea how to convince ci to build a PR? I tried putting the pr number in the paramaterized build, and I tried removing master, and putting the PR number in, and it always seems to be the master that is tested. |
sam-github commented Jun 11, 2015
https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/802/, for example. Anyhow, I'm fine with merging this, but since I didn't run it through CI last time, maybe I should this time. Thoughts, @Fishrock123@jbergstroem ? |
jbergstroem commented Jun 11, 2015
@sam-github I always go for user/branch when setting up the parameterised build -- in this case "Olegas" as user and "fix-cluster-test" for branch. Edit: cancel the current run and get a new one going with above suggestion. Lets see how that one pans out! |
sam-github commented Jun 11, 2015
OK, figured it out, test running here: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/806/ |
jbergstroem commented Jun 11, 2015
I got a lint issue: Edit: some of the bots seems to still time out (freebsd101-32, osx1010, freebsd101-64, centos5-32, smartos14-32, smartos14-64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what's the difference between using setTimeout vs passing a callback to .disconnect()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically have it how it was originally except put the call into the .disconnect(). if that callback is never called then there must be an issue with the patch itself on some platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the issue with my approach. still working through it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trevnorris the disconnect callback will never be called, not while there is an open connection to the worker. the test calls disconnect, and waits one second to confirm that disconnect did NOT in fact, cause disconnection... only when the socket is closed, triggered by the write('.'), does the disconnect proceed.
sam-github commented Jun 11, 2015
@jbergstroem I think we should revert 9c0a1b8 on master, and re-merge this PR when CI is good. Is that the right thing here? |
sam-github commented Jun 11, 2015
@Olegas I cleaned up the lint when I merged the first version, you should |
trevnorris commented Jun 11, 2015
I have an alternative test that gets around the need for the |
jbergstroem commented Jun 11, 2015
@trevnorris sounds good. |
trevnorris commented Jun 11, 2015
@sam-github that is normally the correct thing to do. unless it's critical (e.g. a patch of mine recently caused a segfault when using one module) it's not abnormal that we give the author of the patch warning and a day or so to figure it out. not uncommon that a small thing was missed and can be easily fixed. |
trevnorris commented Jun 11, 2015
CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/808/ Feel free to review my changes to the test: Change: trevnorris@768e6b9 |
trevnorris commented Jun 11, 2015
crap. scrap that one. missed a '.' in a path. canceling last one, and might have brought down jenkins in the process... |
jbergstroem commented Jun 11, 2015
@trevnorris investigating jenkins - stalling over here as well. Unfortunately lack access to the host machine though (for now). |
trevnorris commented Jun 11, 2015
Thanks. Working again. New CI: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/809/ |
sam-github commented Jun 11, 2015
@trevnorris its late, I must be misreading your version: no writes occur except in on data handlers... which is a chicken or the egg scenario, noone will ever write. The essential test script is this:
So, some kind of timeout is necessary... unless you wait a bit, how do you know something has not/will not happen? |
jbergstroem commented Jun 11, 2015
Still seems to be failing [by timeout], unfortunately. Output: |
trevnorris commented Jun 11, 2015
yup. didn't commit a line I changed. this is what I get for coding so late. have a fix. one min. @sam-github you are correct. |
trevnorris commented Jun 11, 2015
Alright, one last time: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/810/ |
Olegas commented Jun 11, 2015
@sam-github@trevnorris I don't understand the difference ( And it seems we have another problem here... I've picked @trevnorris 's test and run it - evething is OK, it passed. Then I just comment out ALL I've started investigation and now I can leave all |
trevnorris commented Jun 11, 2015
@Olegas what platform? |
kkoopa commented Jun 11, 2015
Is there any native code involved here? That kind of problem usually shows up when doing |
trevnorris commented Jun 11, 2015
@kkoopa no native code. here's some interesting output showing the test pass on freebsd when run directly, but then fails running it through I'm running another slight variation of the test now to get a better idea of where it's failing. |
Olegas commented Jun 11, 2015
@trevnorris Mac OS X |
trevnorris commented Jun 11, 2015
@Olegas Thanks. Here's the latest run from my test: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/812/nodes=centos5-32/tapTestReport/test.tap-105/ So it seems there's an issue with the server accepting the client's connection. I'm not sure where to go from there in terms of troubleshooting. /cc @jbergstroem |
sam-github commented Jun 11, 2015
I'm testing @Olegas 's theory that the connection is accepted by the cluster master, but the sockfd is not sent to the worker yet: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/813/ |
sam-github commented Jun 11, 2015
My branch passed on osx 1010! That looks promising, but I have to go catch the bus to nodeconf now. |
trevnorris commented Jun 11, 2015
Will the "onconnection" callback for the server not fire in that case? |
Olegas commented Jun 11, 2015
I've added Next, I can't see the
|
Olegas commented Jun 11, 2015
So, I think there are two problems:
|
trevnorris commented Jun 11, 2015
My test shows the server never actually accepts the connection. Otherwise messages would have been passed between them. I don't see how that relates to (2) and unsure if it could relate to (1). |
sam-github commented Jun 11, 2015
It sounds like the bug is hitting problems with robustness of master/worker protocol by doing a disconnect during a connection. |
jbergstroem commented Jun 11, 2015
Can confirm that the commit in |
sam-github commented Jun 12, 2015
replaced by #1953 |
Fixes: #1933