Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zds: fix retrying a bad netns #1284

Merged
merged 1 commit into from
Aug 28, 2024

Conversation

howardjohn
Copy link
Member

@howardjohn howardjohn commented Aug 28, 2024

Fixes istio/istio#52858

The issue here is this sequence of events:

Pod update: no IP assigned
Get CNI plugin event, Netns=UID1
Partial add error no ztunnel connection

CNI event, Netns=UID2 (changed!)
Partial add error no ztunnel connection
Sending pod to ztunnel as part of snapshot
ACK error, cannot assigned

CNI event, Netns=UID3 (changed!)
No other log from the CNI plugin, which I think implies it did not fail

Basically, we successfully start the proxy, but have the same UID in the pending queue. When the pending retries, it kills the working on + fails to start (since it has its own netns which is bogus).

The fix is to remove a proxy from pending queue when its added

@howardjohn howardjohn added the release-notes-none Indicates a PR that does not require release notes. label Aug 28, 2024
@istio-testing
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 28, 2024
@howardjohn
Copy link
Member Author

/test all

@howardjohn howardjohn marked this pull request as ready for review August 28, 2024 17:51
@howardjohn howardjohn requested a review from a team as a code owner August 28, 2024 17:51
@howardjohn howardjohn force-pushed the zds/fix-retry-bad-netns branch from 2f4c915 to 5e90298 Compare August 28, 2024 17:51
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Aug 28, 2024
@@ -231,6 +231,8 @@ impl WorkloadProxyManagerState {
.await
{
Ok(()) => {
// If the workload is already pending, make sure we drop it, so we don't retry.
self.pending_workloads.remove(workload_uid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm beginning to think we should probably key all of the workload stores with UID + NetNSID to avoid future goofiness like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nice part about UID is it means we will never have multiple Proxy running for a given pod. I would worry if we have netns ID as a key we could end up with some leaky Proxy instances somehow maybe?

Might be possible, but needs care

Copy link
Contributor

@bleggett bleggett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice find, ty!

@bleggett bleggett added the cherrypick/release-1.23 Set this label on a PR to auto-merge it to the release-1.23 branch label Aug 28, 2024
@istio-testing istio-testing merged commit 0cb7516 into istio:master Aug 28, 2024
3 checks passed
@istio-testing
Copy link
Contributor

In response to a cherrypick label: new pull request created: #1285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherrypick/release-1.23 Set this label on a PR to auto-merge it to the release-1.23 branch release-notes-none Indicates a PR that does not require release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address'
3 participants