Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dropped NodePort traffic to hostNetwork backends with Geneve+DSR #36978

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tommasopozzetti
Copy link

@tommasopozzetti tommasopozzetti commented Jan 14, 2025

This PR fixes DSR dispatch with Geneve to hostNetworked pods by ensuring tunnel_endpoint is always populated in ipcache, even for directly reachable endpoints. Unneeded encapsulation is avoided by passing the skiptunnel flag for such endpoints.

Fixes: #36901

@tommasopozzetti tommasopozzetti requested review from a team as code owners January 14, 2025 21:13
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 14, 2025
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli kind/community-contribution This was a contribution made by a community member. labels Jan 14, 2025
@julianwiedmann julianwiedmann self-requested a review January 15, 2025 07:43
@julianwiedmann julianwiedmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. labels Jan 15, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 15, 2025
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 15, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 15, 2025
@julianwiedmann
Copy link
Member

Thank you! I'll have a closer look later.

Also queued up #36982. Once that's merged, we can rebase this PR so that we get immediate feedback on the connectivity test.

Copy link
Contributor

@Artyop Artyop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@julianwiedmann
Copy link
Member

Also queued up #36982. Once that's merged, we can rebase this PR so that we get immediate feedback on the connectivity test.

Merged, I'll auto-rebase the branch...

@julianwiedmann julianwiedmann force-pushed the pr/fix-nodeport-to-hostnetns-dsr-geneve branch from ce585d8 to aa75d28 Compare January 15, 2025 14:40
@julianwiedmann
Copy link
Member

/ci-e2e-upgrade

@tommasopozzetti
Copy link
Author

Thanks for the quick look!
I see some connectivity test cases failed waiting for the new pod that gets introduced though not 100% sure how to go check more details about those setups. Any pointer would be appreciated, first time going through the cilium CI process! 😄

@tommasopozzetti
Copy link
Author

@julianwiedmann fixed the conflict. Any tips on next steps to get this in?

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments inline, before I forgot about the CLI compatibility aspect again ...

bpf/lib/nodeport.h Outdated Show resolved Hide resolved
bpf/lib/nodeport.h Outdated Show resolved Hide resolved
@julianwiedmann
Copy link
Member

@julianwiedmann fixed the conflict. Any tips on next steps to get this in?

In general - please rebase on-top of main, instead of merging main into your branch.

I see some connectivity test cases failed waiting for the new pod that gets introduced though not 100% sure how to go check more details about those setups.

IIRC the CLI tests for cluster-internal connectivity to all the nodeport services on startup, before running the connectivity tests. Is that what's failing? Let's re-run the workflow and find out.

@tommasopozzetti
Copy link
Author

@dylandreimerink I refactored the PR to a potential implementation of the above solution. Curious to hear any feedback.

Given the larger scope of this PR with respect to before, I have also split the original PR into two, splitting off the cilium-cli tests into #37235 so the two can be reviewed and merged separately

Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encryption parts LGTM.

(my personal feeling is, maybe we can split it into two commits, one for "always set tunnel_endpoint in ipcache", the other for "propagate endpointFlags")

@jschwinger233
Copy link
Member

/ci-e2e-upgrade

@tommasopozzetti
Copy link
Author

Happy to split up into two commits if we think that is preferred. I'll wait for some more feedback on the preference and execute if so!
I see 3 e2e test scenarios seem to have failed while all others succeeded. Any pointer on where to go look for more info on why that is and/or the configurations of such tests so I can reproduce them?
Thanks!

@dylandreimerink
Copy link
Member

I see 3 e2e test scenarios seem to have failed while all others succeeded. Any pointer on where to go look for more info on why that is and/or the configurations of such tests so I can reproduce them?

The e2e tests can be flaky, I see they made it on the 2nd try. Passing the ci-e2e-upgrade tests is a good indicator.

Happy to split up into two commits if we think that is preferred. I'll wait for some more feedback on the preference and execute if so!

Splitting the commits would be the nicest approach. The first adding the endpoint flags everywhere, and the second using that capability to implement the DSR fix.

@tommasopozzetti tommasopozzetti force-pushed the pr/fix-nodeport-to-hostnetns-dsr-geneve branch from bf84421 to 5121948 Compare January 28, 2025 05:57
@tommasopozzetti
Copy link
Author

Splitting the commits would be the nicest approach. The first adding the endpoint flags everywhere, and the second using that capability to implement the DSR fix.

Done! And thanks for the clarification on the flaky e2e tests!

@tommasopozzetti tommasopozzetti force-pushed the pr/fix-nodeport-to-hostnetns-dsr-geneve branch from 5121948 to fbcf9c5 Compare January 28, 2025 15:30
@tommasopozzetti
Copy link
Author

Fixed up the unit test

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommasopozzetti Nice work!

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agent changes LGTM 👍

@tommasopozzetti
Copy link
Author

Hi @dylandreimerink, is there anything else that I can do to help push this through?

@dylandreimerink
Copy link
Member

We need a review from @doniacld and @sayboras, or someone else from the @cilium/envoy and @cilium/ipcache code owners.

@dylandreimerink
Copy link
Member

/test

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the changes look good for envoy related file (e.g. pkg/envoy/resources.go)

@dylandreimerink
Copy link
Member

:sigh: CI is failing, seems because this branch is now starting to lag behind main. Could you rebase onto main? That should resolve this.

This commit makes endpoint flags available for plumbing in the new
ipcache API, allowing to set them from higher in the stack

Signed-off-by: Tommaso Pozzetti <tommypozzetti@hotmail.it>
Always set tunnel_endpoint in ipcache, even for directly reachable
endpoints. This ensures DSR info is available for DSR with geneve
dispatch.
Use endpointFlags to set the skiptunnel flag for directly reachable
endpoints such that, even if overlay is enabled, we do not do
an unneeded encap.

Fixes: cilium#36901

Signed-off-by: Tommaso Pozzetti <tommypozzetti@hotmail.it>
@tommasopozzetti tommasopozzetti force-pushed the pr/fix-nodeport-to-hostnetns-dsr-geneve branch from fbcf9c5 to e0c8502 Compare February 5, 2025 15:18
@tommasopozzetti
Copy link
Author

@dylandreimerink Thanks! Done!
Hopefully now we can get it in before it lags too much behind again 😅

@dylandreimerink
Copy link
Member

/test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations cilium-cli This PR contains changes related with cilium-cli feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. kind/community-contribution This was a contribution made by a community member. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium with geneve and DSR drops external traffic to host network backends
8 participants