Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression for timed-out stream cleanups #102489

Merged
merged 1 commit into from
Jun 4, 2021

Conversation

saschagrunert
Copy link
Member

@saschagrunert saschagrunert commented Jun 1, 2021

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

If a stream is already timed-out, then either the data or error stream
may be nil. This would cause a segmentation fault, which is now
covered with this patch.

Which issue(s) this PR fixes:

Fixes #102480

Special notes for your reviewer:

Has to be backported since the original PR got backported, too. :-/

Does this PR introduce a user-facing change?

Fixed a regression that can make kubelet runtime panic for timed-out portforward streams.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

If a stream is already timed-out, then either the data or error stream
may be `nil`. This would cause a segmentation fault, which is now
covered with this patch.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 1, 2021
@k8s-ci-robot k8s-ci-robot requested review from ncdc and thockin June 1, 2021 15:07
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 1, 2021
@saschagrunert
Copy link
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 1, 2021
@saschagrunert
Copy link
Member Author

/retest

@saschagrunert
Copy link
Member Author

/assign @deads2k
please take a look

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 1, 2021
Comment on lines +126 to +129
// It may be possible that the provided stream is nil if timed out.
if stream != nil {
delete(c.streams, stream.Identifier())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that adding a judgment on the caller is also a good choice

h.conn.RemoveStreams(pair.dataStream, pair.errorStream)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wzshiming, thank you for the review! Do you mean that we should move the nil check over to the RemoveStreams() invocation in favor of checking here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine to check in both places, but I want this check to stay here, since it prevents all mistakes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will the timed out stream not remain hanging around in the c.streams map, since the identity is no longer known when calling RemoveStreams?

I guess it gets cleaned up during Close, but since the unit test below does an explicit len(c.streams) check, just wondering if something else might use a similiar check to determine whether the connection can be closed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's an excellent question, good reason to do reverts rather than another round of cherry-picks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it locally with a modified version of CRI-O (using this vendored code) and it looks like that the connections are getting cleaned up if the timeout got reached.

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/priority critical-urgent
/lgtm

this is why I said we should be careful about that backport 😄

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 3, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2021
@dims
Copy link
Member

dims commented Jun 3, 2021

/assign @liggitt

@liggitt
Copy link
Member

liggitt commented Jun 3, 2021

Has to be backported since the original PR got backported, too. :-/

it looks like the original got backported to 1.18 in #100954 , so the final 1.18 release is broken?

@liggitt
Copy link
Member

liggitt commented Jun 3, 2021

/unassign
/assign @lavalamp

@k8s-ci-robot k8s-ci-robot assigned lavalamp and unassigned liggitt Jun 3, 2021
@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2021

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2021
@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2021

I found the analysis here to be convincing.

@lavalamp
Copy link
Member

lavalamp commented Jun 3, 2021

I'm on the fence between backporting this and just reverting the cherry-picks, I was on the fence about including them in the first place. If we didn't catch this with the test, there could be other problems.

@@ -323,6 +323,9 @@ func TestConnectionRemoveStreams(t *testing.T) {
// remove all existing
c.RemoveStreams(stream0, stream1)

// remove nil stream should not crash
c.RemoveStreams(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but I'd be more confident with a test that reproduces the actual crash, that way we can be confident in the diagnosis and not just the fix.

@liggitt
Copy link
Member

liggitt commented Jun 3, 2021

I'm on the fence between backporting this and just reverting the cherry-picks, I was on the fence about including them in the first place. If we didn't catch this with the test, there could be other problems.

If the initial backports weren't fixing a regression, I'd vote for straight reverts. I prefer status quo from prior releases over unsoaked fixes for non-severe issues. I'm not sure what to do about the broken final 1.18 release.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@saschagrunert
Copy link
Member Author

I'm on the fence between backporting this and just reverting the cherry-picks, I was on the fence about including them in the first place. If we didn't catch this with the test, there could be other problems.

If the initial backports weren't fixing a regression, I'd vote for straight reverts. I prefer status quo from prior releases over unsoaked fixes for non-severe issues. I'm not sure what to do about the broken final 1.18 release.

Proposed the reverts in #102586, #102587 and #102588.

Two options for 1.18:

  • Leave it as-is, because it is end of life
  • Revert and create another patch

cc @kubernetes/sig-release-leads

@lavalamp
Copy link
Member

lavalamp commented Jun 4, 2021

Let's propose a revert for 1.18 and then let the release team decide what to do.

@saschagrunert
Copy link
Member Author

Started a slack discussion, closing the loop: https://kubernetes.slack.com/archives/C2C40FMNF/p1623051662213300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

kubelet: Panic on portforward streams