Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a kube-proxy conntrack reconciler #126130

Closed
aojea opened this issue Jul 16, 2024 · 37 comments · Fixed by #127318
Closed

Implement a kube-proxy conntrack reconciler #126130

aojea opened this issue Jul 16, 2024 · 37 comments · Fixed by #127318
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@aojea
Copy link
Member

aojea commented Jul 16, 2024

What would you like to be added?

We'd like to have conntrack reconciler for Services in kube-proxy, for each Service or Endpoint change it should reconcile the Service/Endpoint table with the existing conntrack table and remove stale UDP entries (we have concluded in other issues that UDP is the only protocol that requires this)

Why is this needed?

Despite a lot of work and testing was added to solve the problem with stale UDP entries that can blackhole traffic, we still have reports of issues related to UDP stale entries

#125467

Also, the conntrack logic is event based and, as expected, if something fails it never reconciles

#112604

The reconciler will only delete the entries that are known to be stale, other entries will time out.

References

Some WIP I had https://github.com/kubernetes/kubernetes/compare/master...aojea:kubernetes:conntrack_done_right?expand=1

Related issues that will be fixed

@aojea aojea added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 16, 2024
@aojea
Copy link
Member Author

aojea commented Jul 16, 2024

/sig network
/assign @aroradaman
/cc @danwinship @thockin

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 16, 2024
@aojea
Copy link
Member Author

aojea commented Jul 17, 2024

It is important to understand the performance implications, as listing all the conntrack entries is expensive, I remember talking with Florian Westphal (Kernel developer) about some way of doing server side filtering in the kernel, and avoid having to dump the entire conntrack table and filter flow by flow, but I don't know what is the state of this

@terinjokes
Copy link

The netlink interface to conntrack has the ability to get specific flows and dump all flows that match a given mark and mask.

@thockin
Copy link
Member

thockin commented Jul 18, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 18, 2024
@aojea aojea changed the title kube-proxy conntrack reconciler Implement a kube-proxy conntrack reconciler Jul 18, 2024
@aojea
Copy link
Member Author

aojea commented Jul 31, 2024

/kind bug

Despite I added the feature label, is basically rewriting the existing logic to fix the limitations of current implementation

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2024
@shaneutt
Copy link
Member

shaneutt commented Aug 1, 2024

Hi @aroradaman! Just checking in, as we talked about this one on the SIG network sync today: Have any questions or need any support with this one?

@aroradaman
Copy link
Member

Hi @aroradaman! Just checking in, as we talked about this one on the SIG network sync today: Have any questions or need any support with this one?

Was waiting for vishvananda/netlink#989 to be merged, will resume the work on this now.

@adrianmoisey
Copy link
Member

Prior to Kubernetes 1.28 we were bitten by a conntrack issue on UDP services with iTP set to Local.
When the destination Pod was recreated the conntrack entry persisted.

We fixed this by switching iTP to Cluster.
Since Kubernetes 1.29 (on GKE) we've started getting this problem again, but now when the Service's iTP is set to Cluster.
But only when a node-upgrade happens (I can't reproduce it any other way).

So I appreciate you working on the reconciler. I'm looking forward to this problem going away.

@adrianmoisey
Copy link
Member

adrianmoisey commented Aug 15, 2024

I've been trying to get my head around this problem some more. In AWS (with the AWS CNI) on Kubernetes v1.29.6 (installed by kOps) I seem to be seeing a behaviour where a conntrack entry is being left behind by the CNI.

Our situation is that we multiple pods, many clients and a single server (the server is a DaemonSet).
The clients are streaming UDP over 8125 to the server.
The clients are using a ClusterIP service, with the iTP set to Local.

When everything is working, I see a conntrack entry:

udp      17 30 src=172.24.37.89 dst=100.69.132.196 sport=39931 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=39931 mark=0 use=1

When the server pod is deleted, I see these kube-proxy logs:

I0815 11:40:56.724267      11 cleanup.go:63] "Deleting conntrack stale entries for services" IPs=[]
I0815 11:40:56.724290      11 cleanup.go:69] "Deleting conntrack stale entries for services" nodePorts=[]
I0815 11:40:56.724374      11 conntrack.go:66] "Clearing conntrack entries" parameters=["-D","--orig-dst","100.69.132.196","--dst-nat","172.24.57.127","-p","udp"]
I0815 11:40:56.781317      11 conntrack.go:71] "Conntrack entries deleted" output=<
	conntrack v1.4.7 (conntrack-tools): 7 flow entries have been deleted.
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=59909 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=59909 mark=0 use=1
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=60287 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=60287 mark=0 use=1
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=55964 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=55964 mark=0 use=1
	udp      17 30 src=172.24.37.89 dst=100.69.132.196 sport=41937 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=41937 mark=0 use=52
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=50068 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=50068 mark=0 use=1
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=39931 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=39931 mark=0 use=1
	udp      17 28 src=172.24.37.89 dst=100.69.132.196 sport=42561 dport=8125 [UNREPLIED] src=172.24.57.127 dst=172.24.37.89 sport=8125 dport=42561 mark=0 use=1
 >

After this time, I have the following conntrack entry:

udp      17 30 src=172.24.37.89 dst=100.69.132.196 sport=41937 dport=8125 [UNREPLIED] src=100.69.132.196 dst=172.24.33.81 sport=8125 dport=1267 mark=128 use=3

(notice the mark=128)

Later on, when the new server Pod has started, I see the following:

I0815 11:41:26.590933      11 cleanup.go:63] "Deleting conntrack stale entries for services" IPs=[]
I0815 11:41:26.590965      11 cleanup.go:69] "Deleting conntrack stale entries for services" nodePorts=[]

I'm not too familiar with this entire process, but it seems like the AWS-CNI is adding a conntrack entry, which isn't being cleaned up by kube-proxy.

I'm not sure if the new reconciler will take this situation into account, so I thought I'd mention it.

@adrianmoisey
Copy link
Member

Same setup as before, but in GCP with GKE v1.29.6-gke.1326000:

The clients are streaming UDP over 8125 to the server.
The clients are using a ClusterIP service, with the iTP set to Local.

When everything is working, I see a conntrack entry:

udp      17 30 src=10.11.4.5 dst=10.12.201.0 sport=48931 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=48931 mark=0 use=1

When the server pod is deleted, I see these kube-proxy logs:

I0815 12:15:14.624545       1 cleanup.go:63] "Deleting conntrack stale entries for services" IPs=[]
I0815 12:15:14.624580       1 cleanup.go:69] "Deleting conntrack stale entries for services" nodePorts=[]
I0815 12:15:14.624646       1 conntrack.go:66] "Clearing conntrack entries" parameters=["-D","--orig-dst","10.12.201.0","--dst-nat","10.11.4.22","-p","udp"]
I0815 12:15:14.637712       1 conntrack.go:71] "Conntrack entries deleted" output=<
	udp      17 23 src=10.11.4.30 dst=10.12.201.0 sport=34152 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.30 sport=8125 dport=34152 mark=0 use=1
	udp      17 26 src=10.11.4.38 dst=10.12.201.0 sport=50751 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.38 sport=8125 dport=50751 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=48931 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=48931 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=42608 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=42608 mark=0 use=1
	udp      17 21 src=10.11.4.56 dst=10.12.201.0 sport=50648 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.56 sport=8125 dport=50648 mark=0 use=1
	udp      17 21 src=10.11.4.64 dst=10.12.201.0 sport=60746 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.64 sport=8125 dport=60746 mark=0 use=1
	udp      17 18 src=10.11.4.20 dst=10.12.201.0 sport=40877 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.20 sport=8125 dport=40877 mark=0 use=1
	udp      17 20 src=10.11.4.55 dst=10.12.201.0 sport=56370 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.55 sport=8125 dport=56370 mark=0 use=1
	udp      17 12 src=10.11.4.64 dst=10.12.201.0 sport=37004 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.64 sport=8125 dport=37004 mark=0 use=1
	udp      17 28 src=10.11.4.50 dst=10.12.201.0 sport=57899 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.50 sport=8125 dport=57899 mark=0 use=1
	udp      17 29 src=10.11.4.113 dst=10.12.201.0 sport=58278 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.113 sport=8125 dport=58278 mark=0 use=1
	udp      17 26 src=10.11.4.8 dst=10.12.201.0 sport=40499 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.8 sport=8125 dport=40499 mark=0 use=1
	udp      17 20 src=10.11.4.67 dst=10.12.201.0 sport=48386 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.67 sport=8125 dport=48386 mark=0 use=1
	udp      17 29 src=10.11.4.50 dst=10.12.201.0 sport=55565 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.50 sport=8125 dport=55565 mark=0 use=1
	udp      17 29 src=10.11.4.61 dst=10.12.201.0 sport=36647 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.61 sport=8125 dport=36647 mark=0 use=1
	udp      17 29 src=10.11.4.59 dst=10.12.201.0 sport=52344 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.59 sport=8125 dport=52344 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=35652 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=35652 mark=0 use=1
	udp      17 22 src=10.11.4.145 dst=10.12.201.0 sport=46692 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.145 sport=8125 dport=46692 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=43098 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=43098 mark=0 use=1
	udp      17 29 src=10.11.4.59 dst=10.12.201.0 sport=51314 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.59 sport=8125 dport=51314 mark=0 use=1
	udp      17 29 src=10.11.4.70 dst=10.12.201.0 sport=48682 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.70 sport=8125 dport=48682 mark=0 use=1
	udp      17 20 src=10.11.4.65 dst=10.12.201.0 sport=50380 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.65 sport=8125 dport=50380 mark=0 use=1
	udp      17 16 src=10.11.4.217 dst=10.12.201.0 sport=44527 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.217 sport=8125 dport=44527 mark=0 use=1
	udp      17 21 src=10.11.4.66 dst=10.12.201.0 sport=37202 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.66 sport=8125 dport=37202 mark=0 use=1
	udp      17 24 src=10.11.4.31 dst=10.12.201.0 sport=38039 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.31 sport=8125 dport=38039 mark=0 use=1
	udp      17 22 src=10.11.4.69 dst=10.12.201.0 sport=39822 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.69 sport=8125 dport=39822 mark=0 use=1
	udp      17 29 src=10.11.4.143 dst=10.12.201.0 sport=52499 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.143 sport=8125 dport=52499 mark=0 use=1
	udp      17 21 src=10.11.4.64 dst=10.12.201.0 sport=46370 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.64 sport=8125 dport=4conntrack v1.4.7 (conntrack-tools): 37 flow entries have been deleted.
	6370 mark=0 use=1
	udp      17 1 src=10.11.4.27 dst=10.12.201.0 sport=41082 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.27 sport=8125 dport=41082 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=40601 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=40601 mark=0 use=1
	udp      17 24 src=10.11.4.64 dst=10.12.201.0 sport=57893 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.64 sport=8125 dport=57893 mark=0 use=1
	udp      17 27 src=10.11.4.2 dst=10.12.201.0 sport=32932 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.2 sport=8125 dport=32932 mark=0 use=1
	udp      17 16 src=10.11.4.64 dst=10.12.201.0 sport=47022 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.64 sport=8125 dport=47022 mark=0 use=1
	udp      17 28 src=10.11.4.5 dst=10.12.201.0 sport=40394 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=40394 mark=0 use=1
	udp      17 29 src=10.11.4.61 dst=10.12.201.0 sport=47876 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.61 sport=8125 dport=47876 mark=0 use=1
	udp      17 29 src=10.11.4.24 dst=10.12.201.0 sport=40378 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.24 sport=8125 dport=40378 mark=0 use=1
	udp      17 30 src=10.11.4.5 dst=10.12.201.0 sport=46634 dport=8125 [UNREPLIED] src=10.11.4.22 dst=10.11.4.5 sport=8125 dport=46634 mark=0 use=5

Once the Pod starts up again, I have the incorrect conntrack entry:

udp      17 30 src=10.11.4.5 dst=10.12.201.0 sport=46634 dport=8125 [UNREPLIED] src=10.12.201.0 dst=10.11.4.5 sport=8125 dport=46634 mark=0 use=5

(notice the src=10.12.201.0 on the right)

If I delete it, a new one appears:

udp      17 30 src=10.11.4.5 dst=10.12.201.0 sport=46634 dport=8125 [UNREPLIED] src=10.11.4.23 dst=10.11.4.5 sport=8125 dport=46634 mark=0 use=1

I'm unsure what it causing that incorrect conntrack entry, I assume this is the blackhole route problem that has been plaguing kube-proxy.

@aojea
Copy link
Member Author

aojea commented Aug 21, 2024

in 1.28 there was a refactor of the conntrack logic #116104 but I think is correct, and I couldn't find anything suspicious

(notice the src=10.12.201.0 on the right)

no nat is performed so it keeps the original IP

If I delete it, a new one appears:

that is the client 10.11.4.5 connection to the service IP 10.12.201.0 on port 8125 is DNATed to 10.11.4.23 port 8125

We have a test that exercise this scenario, at the moment that the new pod is created, in your case the one with 10.11.4.23 , kube-proxy should delete that entry

I'm unsure what it causing that incorrect conntrack entry, I assume this is the blackhole route problem that has been plaguing kube-proxy.

@adrianmoisey is this a reproducer or a hypothesis? if you have a repro in gke let me know so I can take a look

@adrianmoisey
Copy link
Member

in 1.28 there was a refactor of the conntrack logic #116104 but I think is correct, and I couldn't find anything suspicious

We noticed a behaviour change between 1.28 and 1.29, which I suspect may be related to #119394

@adrianmoisey is this a reproducer or a hypothesis? if you have a repro in gke let me know so I can take a look

I can reproduce in GKE v1.29.6-gke.1326000

@aroradaman
Copy link
Member

Linking the PR to remove dependency from conntrack binary here - #126847

@aroradaman
Copy link
Member

aroradaman commented Aug 27, 2024

With vishvananda/netlink#989 we can delete all the conntrack entries in single GET/DUMP conntrack table command, reducing the number of netlink calls. That will require a change in the conntrack.Interface to have single generic method to clear all entries, should I include that in this PR or a new one after this gets in?

// Interface for dealing with conntrack
type Interface interface {
	// ClearEntriesForIP deletes conntrack entries for the give flow filters.
	ClearEntriesForFlow(filters ...conntrackFilter) error

@adrianmoisey
Copy link
Member

should I include that in this PR or a new one after this gets in?

My opinion is to include it in a new PR.

@aroradaman
Copy link
Member

I don't think we handle the conntrack cleanup for services that transition from eTP to iTP. Ideally after the transition, all the entries DNATted to cluster endpoints should be removed, will also add an e2e for the same.

@aroradaman
Copy link
Member

I don't think we handle the conntrack cleanup for services that transition from eTP to iTP. Ideally after the transition, all the entries DNATted to cluster endpoints should be removed, will also add an e2e for the same.

Currently, the service changes(traffic policy and topology) only apply to the new connections, with the reconciler we have the opportunity to apply the changes to the existing connection as well.
@danwinship @aojea

@aojea
Copy link
Member Author

aojea commented Sep 3, 2024

Currently, the service changes(traffic policy and topology) only apply to the new connections, with the reconciler we have the opportunity to apply the changes to the existing connection as well. @danwinship @aojea

with the reconciler we can fix all of that as we no longer will depend on the events, so we don't care about transitions anymore.

Reconciler will check desired state by parsing all the services and its characteristics, will dump the conntrack table , and delete the conntrack entries we know will blackhole traffic ... it is UDP so we don't need to react immediately

@danwinship
Copy link
Contributor

danwinship commented Sep 5, 2024

I don't think we handle the conntrack cleanup for services that transition from eTP to iTP. Ideally after the transition, all the entries DNATted to cluster endpoints should be removed

I'm not sure there is really a good use case for switching a live service between eTP and iTP, but if we do care about that, it seems more consistent to me to preserve existing connections rather than killing them.

EDIT: I think by "transition from eTP to iTP", Daman meant "transition either eTP or iTP between Local and Cluster", and that's certainly what I meant...

@adrianmoisey
Copy link
Member

I'm not sure there is really a good use case for switching a live service between eTP and iTP, but if we do care about that, it seems more consistent to me to preserve existing connections rather than killing them.

EDIT: I think by "transition from eTP to iTP", Daman meant "transition either eTP or iTP between Local and Cluster", and that's certainly what I meant...

Speaking about changing iTP from Cluster to Local here:
I had a use case where a vendor's Helm chart changed their Service iTP from Cluster to Local for a DaemonSet that receives statsd metrics over UDP. Since iTP of Local is a "fairly" recent addition I understand why they did that, since I assume many of their customers had clusters created prior to Kubernetes 1.26 (when they didn't have the option of Local)

In this particular use case, I'd expect the blackhole contract entries to be flushed, at some point after the ITP changes, in order to send traffic to the local Pod.

I can't say for certain if my expectation matches other use cases though.

@danwinship
Copy link
Contributor

with the reconciler we can fix all of that as we no longer will depend on the events

Currently, when a service is deleted, we delete all existing DNAT records for its ClusterIP (DeletedUDPClusterIPs). You can't figure that from reconciling, because you can't distinguish "a DNAT record for a ClusterIP that isn't used by any Service" from "a DNAT record for someone else's IP that has nothing to do with us". (Well, OK, if you have ServiceCIDRs you can figure it out.)

Then again, I'm not actually sure why we delete those entries... the original UDP conntrack problem bug report (#19029) only described the problem that is fixed by the DeletedUDPEndpoints case, but the fix (#22573) included the DeletedUDPClusterIPs part too. (NewlyActiveUDPServices came later.) Also, it's conspicuous that we delete records for ClusterIPs, but not for externalIPs, LB IPs, and NodePorts, and no one has ever reported this. It's possible this cleanup is actually pointless.

(I guess it has the effect of immediately killing UDP Service connections when you delete the Service, even if the endpoints still exist, but that behavior is the opposite of how TCP Services behave, so that doesn't seem like a good argument for keeping it...)

@aojea
Copy link
Member Author

aojea commented Sep 6, 2024

I'm sorry for being pedantic but there is no connections in UDP, we delete a conntrack entry, and that does not have any problem, we are not breaking traffic .... an application that is using UDP can not make the assumption that has a connection and is always going to hit the same backend

@danwinship
Copy link
Contributor

I'm not talking about "same backend", I'm talking about "at all".

Deleting a TCP Service while its endpoints are still running results in graceful termination of its clients; you can't create new connections, but existing connections continue to function for as long as the endpoints are there.

Deleting a UDP Service while its endpoints are still running results in immediate termination of all clients; further packets using previously-established 5-tuples will be dropped.

And this is something the code does deliberately; it could implement immediate termination in the TCP case (by deleting conntrack records for deleted TCP ClusterIPs too), or graceful termination in the UDP case (by not deleting conntrack records for deleted UDP ClusterIPs), but it doesn't. And this doesn't seem to be have been an intentional decision. (And it only applies to ClusterIPs! ExternalIPs, LB IPs, and NodePorts get graceful termination for both TCP and UDP.)

@thockin
Copy link
Member

thockin commented Sep 6, 2024

Akshually...

I recall having this discussion in the sig a couple years ago. I'm on mobile now but I can try to find the issue later. We talked about whether it is correct to leave TCP connections open when the service has been deleted, or whether we should break it from the middle. Does NOT break TCP sessions the way you think it might. I was pretty surprised. I tried but I did not find an obvious way to break a TCP session from the middle, short of a black hole.

IIRC we didn't really come to a conclusion about whether it's right or wrong, but it didn't matter because we didn't have a mechanism.

Maybe ebpf has a better way of doing this, or maybe using the netlink library, rather than the conntrack tool?

@aojea
Copy link
Member Author

aojea commented Sep 6, 2024

Deleting a UDP Service while its endpoints are still running results in immediate termination of all clients; further packets using previously-established 5-tuples will be dropped.

doh, I see it now, I was not getting the point , sorry ... but we never need to delete entries if there are no backends, we only need to delete entries when there is at least one backend, because the other entries may blackhole the legit traffic, if there are no backends we let the packets flow

@thockin I think you mean this #108523

Maybe ebpf has a better way of doing this

yeah, it does

@aojea
Copy link
Member Author

aojea commented Sep 7, 2024

I tried but I did not find an obvious way to break a TCP session from the middle, short of a black hole.

@thockin in this case we don't want to leave the connection hanging, since TCP has long timers, even if we achieve to delete the entry I think we want to signal both ends sending a RST so they don't leave sockets open

@danwinship
Copy link
Contributor

danwinship commented Sep 9, 2024

Maybe ebpf has a better way of doing this

ebpf programs can read arbitrary conntrack entries, while netfilter rules can only see the results of applying the current conntrack entries to the current connection... but it doesn't seem like reading arbitrary entries would really be helpful.

For nftables kube-proxy, "Drop any packets to IPs in the ServiceCIDR range that aren't currently-active ClusterIPs" is just a recombination of the two behaviors in #122692, and should be doable with one or two new chains with the right hook+priority, with a single rule.

or maybe using the netlink library, rather than the conntrack tool?

The big thing the netlink library gives us is that we can delete arbitrary sets of entries (in particular, now we can delete blackhole entries for a given IP without also deleting DNAT entries for that IP, which we couldn't do before, #122741), so if the problem before had been "there's no way to delete just the conntrack entries we want to delete", then the library should help. But if the problem was "I deleted all the conntrack entries related to the cluster IP and open connections still stayed open" then the netlink library probably won't help.

@thockin
Copy link
Member

thockin commented Sep 9, 2024

FTR: #108523 (comment) and down was an interesting refresher on the stat of things.

But if the problem was "I deleted all the conntrack entries related to the cluster IP and open connections still stayed open" then the netlink library probably won't help.

I repeated the experiment in #108523 (comment) and got the same results, whether client and server were on the same node or different (through a clusterIP). Perhaps conntrack -F is not aggressive enough? Or more likely, the conntrack listing is a reflection of other state (the established TCP session) and thus can "helpfully" be rebuilt.

So again, if the intention is to break a TCP connection "from the middle", I don't think conntrack is the mechanism, or at least I am not using it right.

@aroradaman
Copy link
Member

Can we set custom marks on conntrack entries that have been DNATed by the proxy?

Currently, we use marking at the packet level, but this is limited to masquerade hairpin traffic, and we also remove the mark in post-routing. By using CONNMARK / ct mark we can set a mark on the flow entry itself, which will help distinguish flows that are created or DNATed by the proxy. From a reconciliation perspective, this marking can be used as a filter to prevent accidental cleanup of non-kube-proxy owned flows.

@aroradaman
Copy link
Member

aroradaman commented Sep 11, 2024

Can we set custom marks on conntrack entries that have been DNATed by the proxy?

Currently, we use marking at the packet level, but this is limited to masquerade hairpin traffic, and we also remove the mark in post-routing. By using CONNMARK / ct mark we can set a mark on the flow entry itself, which will help distinguish flows that are created or DNATed by the proxy. From a reconciliation perspective, this marking can be used as a filter to prevent accidental cleanup of non-kube-proxy owned flows.

ClusterIPs can be filtered out using ServiceCIDRs, CONNMARK will help to distinguish DNAT flows for LoadBalancerIP, ExternalIP and NodePorts that are owned by proxy.

The current approach which I'm experimenting is clearing flow entries which have ClusterIP/ExternalIP/LoadBalancerIP [--orig-dst] or NodePort [--orig-port-dst] as origin and DNATed destination [--reply-src] doesn't belong to any of the serving endpoints.

@aojea
Copy link
Member Author

aojea commented Sep 12, 2024

ONNMARK will help to distinguish DNAT flows for LoadBalancerIP, ExternalIP and NodePorts that are owned by proxy.

I have two concerns with this approach:

The current approach which I'm experimenting is clearing flow entries which have ClusterIP/ExternalIP/LoadBalancerIP [--orig-dst] or NodePort [--orig-port-dst] as origin and DNATed destination [--reply-src] doesn't belong to any of the serving endpoints.

I think is correct, it is simple and solve all the problems: And UDP entry that is stale on an active Service, by active it has at least one endpoint, should be deleted any sync period

@danwinship
Copy link
Contributor

  • mark becomes APIs and a point of conflicts with other elements setting marks

Yes, although the situation when talking about conntrack is a little bit better than with the skb mark; we can use connlabel, which is 128 bits, where it's intended that any given app only use a single bit of it (which is all we'd need here, to mark the entry as ours/not-ours). It still has the problem that there's no easy/dynamic way to negotiate who is using which bit, but it at least doesn't have the problem mark/connmark have where people want to use it as a numeric value and so there aren't enough bits to go around.

I think that independently who creates the entries, if something external create a conntrack entry that blackhole the ClusterIP/ExternalIP/ ... of. a kubernetes Service, kube-proxy should delete it anyway, as it should be authoritative on the system to do that

I disagree. If you, as an admin, don't want to break kubernetes networking then don't break kubernetes networking. There are a million different ridiculous things someone could do as root and there is no way we can cope with all of them. If the admin says "ip link set eth0 down", it is not kube-proxy's responsibility to bring the interface back up. If the admin creates rules blackholing some kube-proxy traffic, it is not kube-proxy's responsibility to remove them (particularly since it seems more likely than not in this case that this would have been intentional rather than accidental).

@jallirs
Copy link

jallirs commented Oct 30, 2024

Just one thought, given not every UDP service is bidirectional(things like udp programs that listen for heartbeats don’t do request/reply back on the same tuple, they’re just one way traffic), for udp services
it should be possible to have udp connection state tracking be optional per udp service.

If we move to deleting the flow entry(even if it’s perceived as “stale” because there was no reply for the first packet) from conntrack, that just causes churn in the conntrack table. When the next packet comes, it will be natted to a new port tuple.

Also, what about scalability, given a large number of UDP flows, how scalable is it to delete all the stale flows?

If you move to the other kube-proxy backends like ipvs, or even using external load balancers, this expectation for bidirectional udp traffic doesn’t exist. As such, to be consistent, expecting bidirectional udp traffic should be optional per service or at least stated as a limitation of the iptables backend that “all udp services deployed in kubernetes should be bidirectional, and every first packet requires a reply packet or traffic on a specific flow may black hole”.

@aojea
Copy link
Member Author

aojea commented Oct 30, 2024

If we move to deleting the flow entry(even if it’s perceived as “stale” because there was no reply for the first packet) from conntrack, that just causes churn in the conntrack table. When the next packet comes, it will be natted to a new port tuple.

if there is a rule that matches the conntrack entry then is not perceived as stale, the logic is not based on if there is response of no, is about there is a conntrack entry that does not have a rule associated to it

@jallirs
Copy link

jallirs commented Oct 30, 2024

"--ctstate", "RELATED,ESTABLISHED",

I’ve seen this and other cases where packets are only forwarded if they are in “related” or “established” state in conntrack and dropped if they’re in “invalid” state. In practice, for udp services this means that the first packet must have a reply or the flow leaves the “new” state and all subsequent inbound traffic is dropped, as the flow never enters “established”.

What I’m trying to state but using better words here: https://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#UDPCONNECTIONS

And the code in netfilter:

https://github.com/torvalds/linux/blob/master/net/netfilter/nf_conntrack_proto_udp.c#L104

@adrianmoisey
Copy link
Member

Now that this has landed in master, should it also be cherry-picked into previous versions?

@danwinship
Copy link
Contributor

no, it's too big to cherry-pick.
but your bug was probably fixed by #127780, which was cherry-picked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants