-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy nftables test are flaky #128829
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @danwinship |
"Device or resource busy" is It looks like this is probably a failure in the partial-sync code; it's not removing a service chain that it should be removing, so it's not possible to remove the corresponding endpoint chains. |
is it harmless? a red herring? |
It might be harmless. If it's removing service IPs from the maps but failing to delete the service chains, then the failure to delete the endpoint chains is harmless (since there's no way for packets to reach them). But if it's not removing service IPs from the maps, then there's a chance it could screw things up if a service IP gets reused later... |
cc @npinaeva |
/priority important-soon Checking this occurrence https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1858725882303090688 The pod is not able to connect to the Service
The client runs in kind-worker2
The rule for the service can not be added because the sync is failing constantly
This should be a blocker for GA, there are 52 |
@aojea @danwinship it looks like kernel version update is the reason for these failures |
I think that COS 105 LTS are the ones based on kernel 5.15, @SergeyKanzhelev @yujuhong can you help me to connect the kernel version to the COS image version? https://cloud.google.com/container-optimized-os/docs/release-notes @BenTheElder @ameukam @mauriciopoppe is there any plan to update the node pools we use on CI to a more modern version of the kernel? |
@aojea We use Ubuntu (22.04) for the nodepools : and they are auto-upgraded based on the GKE Version (currently |
5.15 should be new enough... it just seems like maybe this particular build has a bad backport or something... |
https://github.com/kubernetes/k8s.io/blob/6f350eef158a5f0268adee8395c9827b5553a55b/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L100 seems to be the prow builder VM but not a GCE nodepool VM. However because this is a kind cluster I believe that the k8s cluster would be built on top of the prow builder VM. From #128829 (comment) I believe the next question is for the GCE nodepool VMs but I'll try to answer it:
The last time I remember there was an update of the COS version for data plane nodes created on top of GCE VMs it was through kubernetes/test-infra#31016. It's been a while since then so I'm not sure if there were more changes, if testing on COS the kernel mapping is as follows:
You can find additional details in https://cloud.google.com/container-optimized-os/docs/release-notes. For the prow builder VM and kind, https://github.com/kubernetes/k8s.io/blob/6f350eef158a5f0268adee8395c9827b5553a55b/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L100 points to a GKE nodepool running Ubuntu 22.04 which uses kernel 5.15 as mentioned in #128829 (comment). I can give more details about the Ubuntu 22.04 version, where can I see the prow builder VM creation logs or the terraform logs? I'd like to know the GKE version it was used to create the nodepool, the reason is that the GKE version would change over time because of |
here are the job artifacts https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/pr-logs/pull/128886/pull-kubernetes-e2e-kind-nftables/1859623692095459328/ hopefully there is some useful info |
We use managed clusters where possible (because we don't have a lot of time to operate Kubernetes vs develop it), and that includes the OS image in the GKE case, so that cadence is automated and tied to the GKE release channel. We have currently opted for Ubuntu, because it had IPv6 kernel modules (available, but not loaded) when COS did not, years ago, and we've been using it ever since setting up the first IPv6 (kind) jobs. It's possible we could switch to COS but ... If we need to test on specific Kernel versions, we should implement that directly (disposable VMs etc where we explicitly control this in the CI config). The kind jobs running within the CI pods currently run primarily on the main GKE CI cluster but could be on EKS if we need to shift costs or maybe someday, with more funding ,another provider. I'm not sure what kernels we have on EKS currently but it should be something that works with a stable Kubernetes release. |
These are autoscaled and I'm not sure we're retaining the VM logs longterm ... but it would be in the From the job aartifacts podinfo.json it ran on node pool5 is currently 1.30.5-gke.1443001 on regular channel. |
1.30.5-gke.1443001 -> ubuntu-gke-2204-1-30-v20240929 with the following release notes:
With the current setup where the GKE version comes from the regular channel we could possible bump to Ubuntu 24.04 (which uses kernel 6.8) in GKE 1.32 however this is still up in the air, we are aware of at least one blocker issue that if not addressed would make us stay with Ubuntu 22.04 in GKE 1.32. Anyways, in the happy path where we adopt it in GKE 1.32 I think that a good estimate for it to be available in Regular is the last week of January 2025. #128829 (comment) has a good insight on a possible kernel diff that might have introduced flakiness, @npinaeva we can ask Canonical about kernel diffs between these two versions |
Filed a bug https://bugs.launchpad.net/ubuntu/+bug/2089699, let's see how it goes |
@mauriciopoppe appreciate if you can get some eyes on this from Canonical, in case you have any contact |
Yes, I'll meet them tomorrow and I'll talk to them about bugs.launchpad.net/ubuntu/+bug/2089699, thanks for filing it. |
Canonical is aware of bugs.launchpad.net/ubuntu/+bug/2089699 and were looking for info to reproduce this. Having a similar environment would be hard given that they don't have access to GKE but I mentioned that they just need a GCE VM using the image that they provide to GKE, that's for the VM setup only but for the test it might be difficult to set it up in the same way as https://git.k8s.io/test-infra/config/jobs/kubernetes/sig-network/sig-network-kind.yaml. Is there a way to provide a shell script that can run the test? E.g. turn the Pod spec into something that can be run through a script or through a regular Pod? e.g.
|
|
in this log the first error is only a minute and a half after kube-proxy starts up, so probably we should be able to just give them a set of |
OK, this script just replays the nftables commands from one of the runs in #129061 up to the point where it failed. (Maybe it would have been better to get more commands after that, though if they try running it and it doesn't fail, they could just try immediately running again...) |
Ok, it seems the bug was identified and a fix released https://lists.ubuntu.com/archives/kernel-team/2024-December/155790.html @mauriciopoppe should we avoid these versions of ubuntu with this bug? |
If we know a fix is available and we identify the GKE version I can look into manually requesting a node pool upgrade ahead of the automated schedule. |
Canonical update: It's reproducible in the generic kernels, patchset submitted for review by the mailing list (maybe the one on lists.ubuntu.com/archives/kernel-team/2024-December/155790.html as pointed out in #128829 (comment)). I'll post another update about when we get a new GKE version with the fix. Usually, after Canonical creates a new image for GKE it takes ~2 weeks for the GKE version to be available for manual upgrade. |
This is getting worse since last days https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20master The kernel is still the same |
@ameukam @BenTheElder do we have an alternative to move to a more stable environment in our CI? more jobs are failing and I don't like to be blind because of this known bug |
We could probably setup a COS nodepool, initially with a taint/label and start pinning some of these jobs? I think we should probably consider generally migrating, COS is the recommended default and the only reason we switched previously was to get the ipv6 jobs working (since we could modprobe ipv6 iptables on ubuntu even though they also weren't loaded by default). IIRC COS has ipv6 now? |
yeah, COS93 IIUIC https://cloud.google.com/container-optimized-os/docs/release-notes/m93 |
A negative point of using COS is that Google Cloud might be the primary (or only) user so signals wouldn't give help other companies. A general purpose OS like Ubuntu is good common ground. Ideally, it'd be nice to increase our test dimensions to tests against both OSes or have an additional test dimension against COS that would give signals to Google Cloud. In addition, do upstream tests install packages through DaemonSets or startup scripts after the node is booted? COS has a read only filesystem so it's not possible to install packages on runtime, that might be a limitation/blocker if there's a migration to COS. |
While true, we have other e2e jobs for that which create "real" cloud clusters, and on the EKS prow build cluster (where some of the other jobs run) we're using Amazon Linux so ... I don't think we should attempt to increase OS coverage with For node_e2e and GCE/EC2 cluster e2e we do run with other OSes. Are we running any other jobs with nftables enabled yet?
We're running prow's general CI pods like "run the unit tests in this container", it's just that in this case one of those pods also runs a Kubernetes cluster and happens to share the host kernel (and any issues with that kernel), but it's not intended for kernel coverage versus testing the kubernetes components against each other. I'm ~out until EOY starting tomorrow, but anyone could go ahead and take a stab at the cluster terraform and prowjob updates, @upodroid recently enabled atlantis for GCP terraform so it should auto-deploy now, I think? |
oh, I almost forget about it , I can't remember now which version of COS will have the necessary kernel modules kubernetes/test-infra#32485 EDIT it will be in COS 113 Build 18244-85-14 https://cloud.google.com/container-optimized-os/docs/release-notes/m113#cos-113-18244-85-14_ What version of COS do we have in our CI now? |
Which jobs are flaking?
https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20master
https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20IPv6,%20master
Which tests are flaking?
Seems to impact test randomly
Since when has it been flaking?
15-11-2024
Testgrid link
No response
Reason for failure (if possible)
Checking at https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1858001043166597120/artifacts/kind-worker/pods/kube-system_kube-proxy-tbpmz_fdcd393e-47df-4afe-a88e-27eaa918f570/kube-proxy/0.log
it seems there is some contention on the system
it seems to be present in multiple jobs https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1857638650775343104/artifacts/kind-worker/pods/kube-system_kube-proxy-mcfxw_41e1cb46-c2e5-440c-8174-246253f0def2/kube-proxy/0.log , most probably on some of them reconciling solves the problem
Anything else we need to know?
No response
Relevant SIG(s)
/sig network
The text was updated successfully, but these errors were encountered: