Skip to content

Conversation

@refack
Copy link
Contributor

@refackrefack commented Apr 27, 2017

As discussed in most recently on #10286test/parallel/test-child-process-fork-regr-gh-2847.js is a little flaky
Last sighting was

c:\workspace\node-test-binary-windows\RUN_SUBSET\1\VS_VERSION\vcbt2015\label\win10\test\parallel\test-child-process-fork-regr-gh-2847.js:61 throw err; ^ Error: write EMFILE at exports._errnoException (util.js:1026:11) at ChildProcess.target._send (internal/child_process.js:663:20) at ChildProcess.target.send (internal/child_process.js:547:19) at Worker.send (internal/cluster/worker.js:54:28) at Socket.<anonymous> (c:\workspace\node-test-binary-windows\RUN_SUBSET\1\VS_VERSION\vcbt2015\label\win10\test\parallel\test-child-process-fork-regr-gh-2847.js:33:14) at Object.onceWrapper (events.js:312:19) at emitNone (events.js:105:13) at Socket.emit (events.js:207:7) at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1102:10) 

Possible explanation #3635 (comment)
So as a follow up to #5422 I suggest we also ignore EMFILE raised from the connection.

Ref: #3635
Ref: #5178
Ref: #5179
Ref: #4005
Ref: #5121
Ref: #5422
Ref: #12621 (comment)
Fixes: #10286

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

@nodejs-github-botnodejs-github-bot added the test Issues and PRs related to the tests. label Apr 27, 2017
@refack
Copy link
ContributorAuthor

@Trott
Copy link
Member

@nodejs/testing

Trott
Trott previously requested changes Apr 27, 2017
Copy link
Member

@TrottTrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a reasonable explanation as to why EMFILE is being raised, I'm not thrilled with the idea of adding it just to make the test pass. The other errors listed make sense to me based on the comment above, but EMFILE seems surprising. This could just be my own ignorance....

@refack
Copy link
ContributorAuthor

refack commented Apr 27, 2017

Sorry, I missed adding that reference: #3635 (comment)
P.S. The test passes most of the time without this change, so IMHO it's not forcing it to pass, but making is less flaky.

Copy link
Member

@gibfahngibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc/ @nodejs/platform-windows

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@refack could you put a brief two line explanation of why this error occurs in a comment above? When reading through tests mysteriously ignored errors are always a bit of code smell.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.
PTAL

@mscdexmscdex added the child_process Issues and PRs related to the child_process subsystem. label Apr 27, 2017
@refack
Copy link
ContributorAuthor

@Trott@gibfahn
I tried to document both the original purpose of this test, and the possible causes for the "EMFILE".
test/parallel/test-child-process-fork-regr-gh-2847.js has a very long history of being flaky.
As far as I can tell it's original purpose is to generate a very specific condition that pre #2847 used to cause a segfault, because this is based on asynchronous tasks, sometimes we just hit spurious connection errors.
It specific error was last seen on Windows, but I don't know if this is a Windows only scenario.

@refack
Copy link
ContributorAuthor

This exclusion would be a continuation to #5422
/cc @santigimeno

@Trott
Copy link
Member

Let's see if a stress test can reproduce the error (or other errors) and if load on the machine matters at all. If the answer here is "move to sequential", I'd rather do that than have a bunch of exceptions in the code. Maybe we can even remove an exception or two if we go that route. Or not. Let's see...

Stress test on win10 with just one test running at a time:
https://ci.nodejs.org/job/node-stress-single-test/1180/nodes=win10/console

Stress test on win10 with 100 tests running at a time:
https://ci.nodejs.org/job/node-stress-single-test/1181/nodes=win10/console

@Trott
Copy link
Member

Trott commented Apr 28, 2017

Stress tests seem to show that EMFILE happens under load. Looks like moving the test to sequential may be the way to go therefore? Although I'd be interested in understanding why load causes EMFILE here. Maybe it's obvious once I look at the test, but I gotta log off right now....

@refack
Copy link
ContributorAuthor

refack commented Apr 28, 2017

Maybe windows is actually running out of file handles?
More probably high load causes the scheduler to be more aggressive, hence races.
Is there a way to test is sequentially but under external synthetic load?

IMHO if we want to just keep the original intention of the test (protect from regressing #2847) we could ignore all errors, except a SEGFAULT. If you think this test is useful for covering other cases, then it's worth the extra time...

@Trott
Copy link
Member

Hope I'm not being presumptuous or anything, but on my own fork/branch: I removed all the guard code except the one that I"m sure is needed, moved to sequential, opened #12713 but labeled it in progress and started a CI stress test using the ALL label at https://ci.nodejs.org/job/node-stress-single-test/1182/ to see if any other guard code needs to be restored.

I'm in favor of having this be specific to that regression. I like my tests narrowly focused, especially ones designed for oddball regressions. :-D

@refack
Copy link
ContributorAuthor

Why not have them both, one sequential and unfiltered, and one parallel with the filters? All well documented of course.

@gibfahn
Copy link
Member

Why not have them both, one sequential and unfiltered, and one parallel with the filters? All well documented of course.

Why would we? Moving a test to sequential just means it can't be run in parallel, which is a minor annoyance as it makes our test runs slightly slower. Duplicating the test seems unnecessary.

Is there a way to test is sequentially but under external synthetic load?

I don't think we have a built in way to do that, I normally just build Node in the background, seems to do the trick (not the Node I'm testing obviously 😁 ). Maybe running benchmarks at the same time would provide enough load?

@refack
Copy link
ContributorAuthor

@gibfahn sorry if I'm not RTFM but how do I trigger a stress test (like on the CI) locally?

@gibfahn
Copy link
Member

gibfahn commented Apr 28, 2017

@refacktools/test.py --repeat 1000 parallel will run the parallel test suite 1000 times.

Probably worth doing something like:

tools/test.py --repeat 1000 -p tap parallel/test-child-process-fork-regr-gh-2847 | tee out.tap

and then grepping for not ok.

You could use Powershell's Tee-Object on Windows.

@refack
Copy link
ContributorAuthor

Why would we? Moving a test to sequential just means it can't be run in parallel, which is a minor annoyance as it makes our test runs slightly slower. Duplicating the test seems unnecessary.

@gibfahn: @Trott's premise is that in sequential mode the exception exclusions are unnecessary (#12713). Although they are obviously needed in parallel mode. So having them tested in both modes tests this scenario under two different conditions. "Can't have too much testing"?
Obviously these reasoning should be well documented within the test.

P.S. @Trott you had a typo (sequential}/test-child-process-fork-regr-gh-2847) so I'm re-runing: https://ci.nodejs.org/job/node-stress-single-test/1186/

@gibfahn
Copy link
Member

"Can't have too much testing"?

I disagree. Every test should be testing something useful, otherwise you end up with a huge backlog of tests, and it's not sure what a test is actually for or why it failed.

So having them tested in both modes tests this scenario under two different conditions.

But EMFILE is only thrown very occasionally, and by ignoring this error as well we're not actually testing the EMFILE exception. If we think we should see this EMFILE error under some conditions, then we should try to get a test that reliably reproduces it, and the test for that probably doesn't need to also test this regression.

@Trott
Copy link
Member

Trott commented Apr 28, 2017

@refack tools/test.py --repeat 1000 parallel will run the parallel test suite 1000 times.

To stress test a parallel test locally, I also use -j to specify how many tests to run simultaneously.

$ tools/test.py --repeat=92 -j 92 parallel/test-foo-bar

@TrottTrott dismissed their stale reviewApril 28, 2017 14:53

Just got EMFILE in a sequential run, and reading the explanation linked above, it seems to be an unexpected if infrequent thing to happen in this test. Clearing my objection.

@Trott
Copy link
Member

Argh, can't edit the above (I don't think) but I meant to say expected if infrequent rather than unexpected if infrequent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sporius -> spurious

Copy link
Member

@gibfahngibfahnApr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:
- ECONNREFUSED or ECONNRESET errors can happen if this connection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- If a previous request...

Basically can it be really clear which explanation is for which errors?

Copy link
Member

@gibfahngibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM modulo nits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2847 -> https://github.com/nodejs/node/pull/2847

a the -> a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to connect to ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need the Description:

@refack
Copy link
ContributorAuthor

@gibfahn, commands addressed PTAL.

Copy link
Member

@santigimenosantigimeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one nit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can happen if this connection is repeated twice

Copy link
Member

@jasnelljasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM with a couple of nits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment block should use the typical // style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change

@refack
Copy link
ContributorAuthor

Comments fixed, PTAL.

[joke]
You are all a bunch of nit-pickers
[/joke]
JK. Love you all 💌

Copy link
Member

@gibfahngibfahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are all a bunch of nit-pickers

Speaking of which...

(I see it as a game, can you minimize the number of nits you'll get per PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's -> its

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should be an issue, so:

https://github.com/nodejs/node/pull/2847->https://github.com/nodejs/node/issues/2847

Also can you order this as in the guide? Should look like:

'use strict';constcommon=require('../common');// Before https://github.com/nodejs/node/issues/2847 a child process trying// (asynchronously) to use the closed channel to it's creator caused a segfault.constassert=require('assert');

@refackrefackforce-pushed the more-for-10286 branch 3 times, most recently from ab5f12c to b372ff6CompareApril 28, 2017 23:48
According to the explanation in nodejs#3635#issuecomment-157714683 And as a continuation to nodejs#5422 we also ignore EMFILE "No more file descriptors are available,so no more files can be opened" PR-URL: nodejs#12698Fixes: nodejs#10286 Refs: nodejs#3635 (comment) Refs: nodejs#5178 Refs: nodejs#5179 Refs: nodejs#4005 Refs: nodejs#5121 Refs: nodejs#5422 Refs: nodejs#12621 (comment) Reviewed-By: Gibson Fahnestock <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
@refackrefack merged commit 6f21671 into nodejs:masterMay 19, 2017
@refack
Copy link
ContributorAuthor

Landed in 6f21671

@refack
Copy link
ContributorAuthor

@refackrefack deleted the more-for-10286 branch May 19, 2017 19:33
@jasnelljasnell mentioned this pull request May 28, 2017
@gibfahngibfahn mentioned this pull request Jun 15, 2017
3 tasks
MylesBorins pushed a commit that referenced this pull request Jul 17, 2017
According to the explanation in #3635#issuecomment-157714683 And as a continuation to #5422 we also ignore EMFILE "No more file descriptors are available,so no more files can be opened" PR-URL: #12698Fixes: #10286 Refs: #3635 (comment) Refs: #5178 Refs: #5179 Refs: #4005 Refs: #5121 Refs: #5422 Refs: #12621 (comment) Reviewed-By: Gibson Fahnestock <[email protected]> Reviewed-By: Santiago Gimeno <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>
@MylesBorins
Copy link
Contributor

I've landed this on v6.x... please lmk if this was a mistake

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

child_processIssues and PRs related to the child_process subsystem.testIssues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent parallel/test-http-regr-gh-2821 failure on V6.9.2 (Windows)

9 participants

@refack@Trott@gibfahn@MylesBorins@jasnell@santigimeno@cjihrig@mscdex@nodejs-github-bot