Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 34.2k
Revert "net: enable autoSelectFamily by default"#48403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
Conversation
This reverts commit 8b51c1a. Signed-off-by: Matteo Collina <[email protected]>
nodejs-github-bot commented Jun 9, 2023
Review requested:
|
mcollina commented Jun 9, 2023
cc @nodejs/releasers @nodejs/tsc This is technically semver-major but it's useful only if it is backported to Node v20. |
mcollina commented Jun 9, 2023
We might want to land this only to v20.x. |
nodejs-github-bot commented Jun 9, 2023
tniessen commented Jun 9, 2023
For context, there are related discussions in #47644, #48028, and #48145. @silverwind suggested reverting in #48145, and other collaborators have mentioned that they'd have preferred not to enable Happy Eyeballs in the first place. However, according to @ShogunPanda, all known bugs (aside from deviating from the RFC) are fixed in 20.3.0. Do the fastify issues persist with 20.3.0 @mcollina? |
ShogunPanda commented Jun 9, 2023
Yes, it persist with 20.3.0. |
ShogunPanda commented Jun 9, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
About having this enabled by default: I'm a bit conflicted on this. Regardless of the implementation, I wonder if will cause more harm than good to users. To clarify, it is good that we had lot of bugs reports and we were able to fix them. But I start thinking a feature like this has too many network implications that probably we should have it disable by default and always have user explicitly opt-in when they need it. |
silverwind commented Jun 9, 2023
My preference would be to revert only for v20 and working out the remaining issues on v21+ (#48145 would be a must-have for me). |
ShogunPanda commented Jun 9, 2023
I don't understand the parallel connection request. According to Happy Eyeballs RFC 8305, section 5, connections attempts SHOULD NOT be made in parallel. |
silverwind commented Jun 9, 2023
I think you are misinterpreting the RFC. It just says they should not start in parallel (meaning true concurrency, e.g. multiple CPUs), but they can and should be pending at the same time. Essentially the connections are raced against each other. At least that is the whole point of Happy Eyeballs to my understanding and Wikipedia seems to agree. |
ShogunPanda commented Jun 9, 2023
I'm not quite sure about it. The RFC complete sentence is: So it's not referring to just resources on the originating machine. If we implement like that, each new connection will result in at least two connection attempts. This will affect source machine, routers, ISP and so forth. |
tniessen commented Jun 9, 2023
I think the RFC authors mean that multiple connection attempts should not be started simultaneously. A good implementation should base the delay between attempts on the SYN retransmission, so at most two SYN packets would be sent at a time in a proper parallel implementation. @silverwind is right. The important sentence from the RFC is this:
|
richardlau commented Jun 9, 2023
The discussion around parallel connections should happen in #48145. As far as this PR goes, there are test failures, so presumably either more commits need to be reverted or additional commits to fix the tests. |
ShogunPanda commented Jun 10, 2023
@richardlau Sure, we will move the discussion there. @mcollina I checked your fastify failure and I made several tests using main branch. The problems seems not to be in family autoselection (as I manually disabled in the code) but in the presence of |
ShogunPanda commented Jun 10, 2023
I confirmed in private with @mcollina. The fastify failures are unrelated to family auto selection and it seems to be a weird combination between the tap ecosystem, Node and NODE_OPTIONS. Do we want to close this PR or do we still want to proceed? |
tniessen commented Jun 10, 2023
I don't have a strong opinion. The current implementation is flawed, apparently both due to technical reasons and because of disagreement w.r.t. how the RFC should be interpreted, and most importantly, it does introduce new network problems. On the other hand, it also solves some historic (and somehow still relevant) network problems. Myself and others would have preferred |
tniessen commented Jun 13, 2023
According to KararTY/dank-twitch-irc#13 (comment), the
Does this imply that you are in favor of disabling the feature @ShogunPanda? |
ShogunPanda commented Jun 13, 2023
@tniessen To be honest no. I checked KararTY/dank-twitch-irc#13 (comment) and I could not reproduce the issue. All the other bug reports you mentioned are lacking the version so it's impossible to see whether they were in 20.3.0 or not. Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this? |
tniessen commented Jun 13, 2023
Because each "annoying duplicate bug report", as you put it, represents at least one but potentially even a large number of negatively affected users, whereas we have no reliable metrics for the benefits of the feature. (To be fair, I am no expert in this matter so I tend to rely on others' expertise, and I have yet to meet someone more knowledgeable than Ben, for example, who was against enabling this by default.) |
ShogunPanda commented Jun 13, 2023
I see. |
joyeecheung commented Jun 13, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0. But in retrospect I think we should've reverted this a month ago. When we notice any large-scale regression, it's usually never really worth waiting for a non-trivial fix. Reverting and do a patch release ASAP and then wait for a fix to reland is almost always better. The more we wait the more users get frustrated and will have to add workarounds to their code/environment which are meant to go away. |
tniessen commented Jun 14, 2023
I couldn't agree more @joyeecheung. We should have reverted much earlier and postponed to Node.js 21, which also won't become LTS, or perhaps indefinitely.
It has been confirmed, see KararTY/dank-twitch-irc#13 (comment). But I guess the intention is to fix yet another bug and hope that that finally resolves the issue. |
ShogunPanda commented Jun 14, 2023
@tniessen I already fixed that bug. AFAIK there is only npm/cli#6409 (which I'm working on at the moment) and then I have nothing else reported. Am I missing any bug? |
tniessen commented Jun 14, 2023
I'm talking about the bug discussed in KararTY/dank-twitch-irc#13, which was supposed to be fixed in 20.3.0 :) |
ShogunPanda commented Jun 14, 2023
Lol, yes. I know. Coincidentally, two bugs failed in the same assertion. I fixed the first one and now I'm taking care of the second one. |
joyeecheung commented Jun 14, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
If we still have bugs that are pending a fix, I think the best thing to do is have a release ASAP that turns it off by default again. The more we wait the more users are going to be bothered by regressions. Improving the implementation of a new feature is not a good justification for regressions, IMO. (Do we really need a full revert like what this PR does though? It seems to me we just need to set the CLI option to false by default again) |
ShogunPanda commented Jun 14, 2023
@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately. Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore. |
RafaelGSS commented Jun 14, 2023
I don't have a strong opinion either, but It's unlikely to have this PR ready for this week and have it released due to the upcoming security release. Considering the main purpose of this issue was to prevent connectivity issues within IPV6, I feel this is something we'll need to have at some moment. As stated earlier in this thread, if we had to revert, we should have rolled back weeks ago. Reverting it now into a breaking change isn't great IMO. |
silverwind commented Jun 14, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
How about only disabling in on the v20 branch? I guess v21 can accept a bit more instability as you have seen that v19 barely got any real testing, and that will give plenty of time to fix the issues before v22. |
ShogunPanda commented Jun 14, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
That might be an acceptable solution, but I'm still pretty skeptical about the approach. I think we might be able to solve all the (major) issues in few weeks. I totally agree that we should have reverted this long ago probably (and I apologize about this), but now the pro and cons of a breaking change lean towards the cons. |
ShogunPanda commented Jun 15, 2023
FYI: I fixed all issue I'm aware of in #48464. |
joyeecheung commented Jun 15, 2023 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Why do we need a full revert just because it's renamed? We could also keep the name but just turn it off by default again?
Why would you think that would be the case? We could just turn it off and quickly release a patch, then land the fix and reenable it. If bug reports still come in, we do that again. Turning it on and off isn't ideal, but probably better than having a large-scale regression frustrating users for weeks. This happens a lot in Chromium/V8, for example, though they have a different release model (mostly Canary + Stable, but not too different from our Current + LTS I would say). That said now that the security release is in progress, it's probably too late to do this cycle now. But we should probably try something similar for large-scale regressions in the future. |
ShogunPanda commented Jun 19, 2023
It's not just the rename of the options. We also have to revert some tests that changed error behavior.
Well, all flipping is technically a breaking change. If we don't care too much then probably is not an issue. |
joyeecheung commented Jun 20, 2023
I think we could just enable the flag in those tests?
The point about not breaking is so that we don't have large-scale regressions. As this feature is very new I doubt anyone is currently relying on it being the default, meanwhile judging from the bug reports, there are definitely more people relying on it not regressing.
I don't see that as a problem when we are already sure about the errors. Again, we could just revert it while we work on known and certain bugs, instead of leaving users frustrated while we work on the bugs that we already know. We can just flip it back when the fix lands to see if the error still reproduces, Chromium/V8 has been doing this for many years and they have a much larger user base than ours. |
tniessen commented Jun 20, 2023
And yet, these problems have not been tackled quickly. Node.js 20 was released more than two months ago and #47644 was opened just a day later. Still, every 20.x release since then has contained bugs related to this feature that affect a large number of users. |
ShogunPanda commented Jun 20, 2023
I'm not sure where you are going with this. As soon as bugs started popping I promptly tried to fix all of them at my maximum speed. Unfortunately, unless we went for patch release (and, TBH, I don't even know what is the procedure for this), we had to wait for new release to be cut for people to benefit for it. |
ShogunPanda commented Jun 20, 2023
Good point. Yes, we can.
We can definitely do that. At the moment I've fixed all the bugs I had a report for. |
tniessen commented Jun 20, 2023
I think we all agree that we should have reverted many weeks ago. That's all I wanted to say :) |
lpinca commented Jul 23, 2023
It seems things are better now. I'm closing this. We can reopen it if needed. |
This reverts commit 8b51c1a.
I think we were too eager to turn this on. In the Fastify community, we have been trying to get our tests to pass on Node v20 for a month and a half with minimal luck. The only solution was to use
NODE_OPTIONS=no-network-family-autoselection.Note that tests spuriously fail even with v20.3.0, so the fixes are not complete. I was not able to reproduce the problem individually. My 2 cents is that there is a race condition somewhere triggered by many open sockets in a short period of time.