-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression for timed-out stream cleanups #102489
Conversation
If a stream is already timed-out, then either the data or error stream may be `nil`. This would cause a segmentation fault, which is now covered with this patch. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
/sig node |
/retest |
/assign @deads2k |
/triage accepted |
// It may be possible that the provided stream is nil if timed out. | ||
if stream != nil { | ||
delete(c.streams, stream.Identifier()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that adding a judgment on the caller is also a good choice
h.conn.RemoveStreams(pair.dataStream, pair.errorStream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @wzshiming, thank you for the review! Do you mean that we should move the nil check over to the RemoveStreams()
invocation in favor of checking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine to check in both places, but I want this check to stay here, since it prevents all mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: will the timed out stream not remain hanging around in the c.streams map, since the identity is no longer known when calling RemoveStreams?
I guess it gets cleaned up during Close, but since the unit test below does an explicit len(c.streams) check, just wondering if something else might use a similiar check to determine whether the connection can be closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's an excellent question, good reason to do reverts rather than another round of cherry-picks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it locally with a modified version of CRI-O (using this vendored code) and it looks like that the connections are getting cleaned up if the timeout got reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/priority critical-urgent
/lgtm
this is why I said we should be careful about that backport 😄
/assign @liggitt |
it looks like the original got backported to 1.18 in #100954 , so the final 1.18 release is broken? |
/unassign |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I found the analysis here to be convincing. |
I'm on the fence between backporting this and just reverting the cherry-picks, I was on the fence about including them in the first place. If we didn't catch this with the test, there could be other problems. |
@@ -323,6 +323,9 @@ func TestConnectionRemoveStreams(t *testing.T) { | |||
// remove all existing | |||
c.RemoveStreams(stream0, stream1) | |||
|
|||
// remove nil stream should not crash | |||
c.RemoveStreams(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I'd be more confident with a test that reproduces the actual crash, that way we can be confident in the diagnosis and not just the fix.
If the initial backports weren't fixing a regression, I'd vote for straight reverts. I prefer status quo from prior releases over unsoaked fixes for non-severe issues. I'm not sure what to do about the broken final 1.18 release. |
/retest Review the full test history for this PR. Silence the bot with an |
Proposed the reverts in #102586, #102587 and #102588. Two options for 1.18:
cc @kubernetes/sig-release-leads |
Let's propose a revert for 1.18 and then let the release team decide what to do. |
Started a slack discussion, closing the loop: https://kubernetes.slack.com/archives/C2C40FMNF/p1623051662213300 |
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
If a stream is already timed-out, then either the data or error stream
may be
nil
. This would cause a segmentation fault, which is nowcovered with this patch.
Which issue(s) this PR fixes:
Fixes #102480
Special notes for your reviewer:
Has to be backported since the original PR got backported, too. :-/
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: