Skip to content

Conversation

@gibfahn
Copy link
Member

@gibfahngibfahn commented Sep 15, 2016

Checklist
  • make -j4 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test

Description of change

Closes the server as soon as the child receives a message

I'm not sure that this is the correct fix, but I've tried this with some debugging information (see this gist) and both parent and child do seem to be receiving a message on AIX.

Fixes: #8271

Refs:
Previous PR (to fix on SmartOS): nodejs/node-v0.x-archive#8045
Previous Issue (ditto): nodejs/node-v0.x-archive#8046
Original Commit: bdf7ac2
Original PR: nodejs/node-v0.x-archive#4864

@nodejs-github-botnodejs-github-bot added the test Issues and PRs related to the tests. label Sep 15, 2016
@gibfahn
Copy link
MemberAuthor

Failure on AIX (before)

 /home/gib/node $ tools/test.py --repeat=10 --timeout=5 parallel/test-dgram-debug === release test-dgram-debug === Path: parallel/test-dgram-debug ||| Child server Parent Listening child msg 1 = server parentGotMessage: 0 childGotMessage: 0 === Sending msg 1 ### Parent server:{address: '127.0.0.1', family: 'IPv4', port: 42024 } Child msg 2 = 1 parent: gotMessage from child parentGotMessage: 0 childGotMessage: 1 === Sending msg 2 Child msg 3 = 2 parentGotMessage: 0 childGotMessage: 1 === Sending msg 3 Child msg 4 = 3 parentGotMessage: 0 childGotMessage: 1 === Sending msg 4 Child msg 5 = 4 parentGotMessage: 0 childGotMessage: 1 === Sending msg 5 Child msg 6 = 5 parentGotMessage: 0 childGotMessage: 1 === Sending msg 6 Child msg 7 = 6 parentGotMessage: 0 childGotMessage: 1 === Sending msg 7 Child msg 8 = 7 parentGotMessage: 0 childGotMessage: 1 === Sending msg 8 Child msg 9 = 8 parentGotMessage: 0 childGotMessage: 1 === Sending msg 9 Child msg 10 = 9 parentGotMessage: 0 childGotMessage: 1 === Sending msg 10 Child msg 1000 = 999 parentGotMessage: 0 childGotMessage: 1 === Sending msg 1000 Child msg 2000 = 1999 parentGotMessage: 0 childGotMessage: 1 === Sending msg 2000 Child msg 3000 = 2999 parentGotMessage: 0 childGotMessage: 1 === Sending msg 3000 Child msg 4000 = 3999 parentGotMessage: 0 childGotMessage: 1 === Sending msg 4000 Command: out/Release/node /home/gib/node/test/parallel/test-dgram-debug.js --- TIMEOUT --- 

Passing on AIX (after)

=== release test-dgram-debug [negative] === Path: parallel/test-dgram-debug ||| Child server Parent Listening parentGotMessage: 0 childGotMessage: 0 === Sending msg 1 ### Parent server: child msg 1 = server{address: '127.0.0.1', family: 'IPv4', port: 42142 } Child msg 2 = 1 parent: gotMessage from child parentGotMessage: 0 childGotMessage: 1 === Sending msg 2 parent msg 1 = 2 parentGotMessage: 1 childGotMessage: 1 === Sending msg 3 Message received by child && parent Shutting down child msg 3 = stop Child: stop message received Command: out/Release/node /home/gib/node/test/parallel/test-dgram-debug.js 

@mscdexmscdex added the dgram Issues and PRs related to the dgram subsystem / UDP. label Sep 15, 2016
@santigimeno
Copy link
Member

Not too sure. If the intention of the test was that messages reach both the parent and the child while both are listening, then probably the fix is not correct. If not, probably it is.

@santigimenosantigimeno added the child_process Issues and PRs related to the child_process subsystem. label Sep 15, 2016
@gibfahn
Copy link
MemberAuthor

@santigimeno Right, I agree. The strange thing is that I haven't managed to reproduce the issue I'm seeing on any other machines (except the community AIX ones). The test runs fine on all our AIX boxes.

Looking at nodejs/node-v0.x-archive#8045 (comment), it seems that this change shouldn't make any difference.

cc. @AndreasMadsen@misterdjules

@AndreasMadsen
Copy link
Member

Not too sure. If the intention of the test was that messages reach both the parent and the child while both are listening, then probably the fix is not correct. If not, probably it is.

Agreed.

@misterdjules
Copy link

Testing that UDP messages are distributed among all processes from a given set of processes whose members do not change over time and that share a UDP socket seems inherently brittle. We might be able to decrease the likeliness of such failures by tweaking various settings (number of messages, number of child processes, etc.) but it seems the results will always depend on the underlying platform.

Changing the state (preventing them from being able to receive messages) of the processes as soon as they receive one message like this PR does changes the nature of this test, so it would not be acceptable if we wanted the test to keep its original purpose.

I see two possible ways forward:

  1. Mark this test as flaky on AIX, and watch for flaky tests that fail all the time to identify actual problems.
  2. Change the nature of this test to just test that we can share a UDP socket among two related processes, and not test that the load is distributed. We could thus just check that UDP messages sent to a single endpoint are eventually received by both processes and make it reliable by making the first process that received a message stop receiving messages.

I think I have a preference for 2), and it seems in line with the current caveats of using the cluster module without using the round robin scheduler, where one cannot rely on the load being distributed evenly (or in any way) among worker processes.

@gibfahn
Copy link
MemberAuthor

What I'm not sure about is whether we can expect the load to be at all distributed between the two processes.

I'd suggest that there are two possibilities (mirroring those in #8549 (comment)):

  1. It's possible that there is a subtle problem on AIX (possibly due to a race condition) which means that the parent isn't properly connecting sometimes, which is why all the messages (~60,000) go to the child not the parent. If this is true then the test is useful as is (and the AIX failures are a symptom of a real condition).
  2. It's also possible that we can't expect the load to be distributed, so it's fine for one process to receive all the messages. In this case the test needs to be changed to reflect the underlying behaviour (as suggested by @misterdjulesabove).

@AndreasMadsen I wasn't sure from #8549 (comment) whether we can expect messages to be distributed to both child and parent. Could you clarify? I noticed that you said in nodejs/node-v0.x-archive#8045 (comment) that:

The purpose of this test is to ensure at that there is some level of distributed behavior.

I also had a look at what server.close() does, and (assuming I'm looking in the right place), server.close() will close the socket directly. If this is true then I'm not sure how the modified version of this test in the PR actually works (the parent hasn't received any messages, the child closes the server, then the parent receives a message).

I also remember that calling server.close() will close the entire server across all workers.

@AndreasMadsen
Copy link
Member

whether we can expect messages to be distributed to both child and parent

We can definitely expect that, that is what is being tested. But I guess it could be platform dependent.

If this is true then I'm not sure how the modified version of this test in the PR actually works (the parent hasn't received any messages, the child closes the server, then the parent receives a message).

I wouldn't trust the documentation for how cluster UDP acts, since it's a rarely used feature.


To be honest I don't remember much, I worked with this stuff 4 years ago, but haven't since. Sorry :/

@mhdawson
Copy link
Member

Somehow I missed this PR so I created #8271. Looking at the discussion here I think it does option 2) as suggested by @misterdjules which would also be my preference.

It also fixes the issue of the test leaving processes hanging around.

@mhdawson
Copy link
Member

@gibfahn I don't think "which means that the parent isn't properly connecting sometimes, " as which my version we do see both getting at least one message.

@mhdawson
Copy link
Member

In terms of: "server.close() does, and (assuming I'm looking in the right place), server.close() will close the socket directly."

I think that after a fork each process has its own handle to the share socket and can close their handle independently which would explain the behavior seen.

@gibfahn
Copy link
MemberAuthor

Closing in favour of #8697 which does the same thing but closes the parent on received message as well (and stops processes being left around if the test fails).

@gibfahngibfahn closed this Sep 21, 2016
@gibfahngibfahn deleted the test-dgram-pr branch December 16, 2016 11:15
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

child_processIssues and PRs related to the child_process subsystem.dgramIssues and PRs related to the dgram subsystem / UDP.testIssues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test-child-process-fork-dgram Intermittent timeout on AIX

7 participants

@gibfahn@santigimeno@AndreasMadsen@misterdjules@mhdawson@mscdex@nodejs-github-bot