Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy nftables test are flaky #128829

Open
aojea opened this issue Nov 17, 2024 · 35 comments
Open

kube-proxy nftables test are flaky #128829

aojea opened this issue Nov 17, 2024 · 35 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@aojea
Copy link
Member

aojea commented Nov 17, 2024

Which jobs are flaking?

https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20master
https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20IPv6,%20master

Which tests are flaking?

Seems to impact test randomly

Since when has it been flaking?

15-11-2024

Testgrid link

No response

Reason for failure (if possible)

Checking at https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1858001043166597120/artifacts/kind-worker/pods/kube-system_kube-proxy-tbpmz_fdcd393e-47df-4afe-a88e-27eaa918f570/kube-proxy/0.log

it seems there is some contention on the system

2024-11-17T04:51:06.117022547Z stderr F E1117 04:51:06.116486       1 proxier.go:1210] "Unable to delete stale chains; will retry later" err=<
2024-11-17T04:51:06.117050539Z stderr F 	/dev/stdin:2:28-104: Error: Could not process rule: Device or resource busy
2024-11-17T04:51:06.117056524Z stderr F 	delete chain ip kube-proxy endpoint-LR2XJHKW-services-4884/service-proxy-toggled/tcp/__10.244.1.180/9376
2024-11-17T04:51:06.117062922Z stderr F 	                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-17T04:51:06.117070369Z stderr F 	/dev/stdin:51:28-109: Error: Could not process rule: Device or resource busy
2024-11-17T04:51:06.117075351Z stderr F 	delete chain ip kube-proxy endpoint-23XNSDHO-nettest-2983/session-affinity-service/udp/udp__10.244.1.119/8081
2024-11-17T04:51:06.117079767Z stderr F 	                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-17T04:51:06.117084441Z stderr F 	/dev/stdin:57:28-108: Error: Could not process rule: Device or resource busy
2024-11-17T04:51:06.117088698Z stderr F 	delete chain ip kube-proxy endpoint-CLRXJGC4-services-8162/affinity-clusterip-timeout/tcp/__10.244.1.97/9376
2024-11-17T04:51:06.117092772Z stderr F 	                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-17T04:51:06.117096629Z stderr F 	/dev/stdin:58:28-104: Error: Could not process rule: Device or resource busy
2024-11-17T04:51:06.117100766Z stderr F 	delete chain ip kube-proxy endpoint-O34ANCIM-services-4884/service-proxy-toggled/tcp/__10.244.1.181/9376
2024-11-17T04:51:06.117122875Z stderr F 

it seems to be present in multiple jobs https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1857638650775343104/artifacts/kind-worker/pods/kube-system_kube-proxy-mcfxw_41e1cb46-c2e5-440c-8174-246253f0def2/kube-proxy/0.log , most probably on some of them reconciling solves the problem

2024-11-16T04:31:53.06971039Z stderr F I1116 04:31:53.068065       1 proxier.go:1174] "Syncing nftables rules" ipFamily="IPv4"
2024-11-16T04:31:53.069715193Z stderr F I1116 04:31:53.068092       1 proxier.go:1204] "Deleting stale nftables chains" ipFamily="IPv4" numChains=4
2024-11-16T04:31:53.126729378Z stderr F E1116 04:31:53.125738       1 proxier.go:1210] "Unable to delete stale chains; will retry later" err=<
2024-11-16T04:31:53.12702478Z stderr F 	/dev/stdin:2:28-100: Error: Could not process rule: Device or resource busy
2024-11-16T04:31:53.12703517Z stderr F 	delete chain ip kube-proxy endpoint-YHMEV5IX-services-5210/externalip-test/tcp/http__10.244.1.6/9376
2024-11-16T04:31:53.127040928Z stderr F 	                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-16T04:31:53.127046499Z stderr F  > ipFamily="IPv4"

Anything else we need to know?

No response

Relevant SIG(s)

/sig network

@aojea aojea added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 17, 2024
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 17, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 17, 2024
@aojea
Copy link
Member Author

aojea commented Nov 17, 2024

/assign @danwinship

@danwinship
Copy link
Contributor

"Device or resource busy" is EBUSY, which in the context of "nft delete chain" means "you're trying to delete a chain that is still referenced from elsewhere".

It looks like this is probably a failure in the partial-sync code; it's not removing a service chain that it should be removing, so it's not possible to remove the corresponding endpoint chains.

@aojea
Copy link
Member Author

aojea commented Nov 18, 2024

is it harmless? a red herring?

@danwinship
Copy link
Contributor

It might be harmless. If it's removing service IPs from the maps but failing to delete the service chains, then the failure to delete the endpoint chains is harmless (since there's no way for packets to reach them). But if it's not removing service IPs from the maps, then there's a chance it could screw things up if a service IP gets reused later...

@danwinship
Copy link
Contributor

cc @npinaeva

@aojea
Copy link
Member Author

aojea commented Nov 19, 2024

/priority important-soon

Checking this occurrence https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1858725882303090688

The pod is not able to connect to the Service

1119 04:40:04.726397 71449 builder.go:146] stderr: "+ curl -q -s --connect-timeout 5 172.18.0.2:31520/hostname\n"
I1119 04:40:04.726459 71449 builder.go:147] stdout: "webserver-pod"
I1119 04:40:04.726616 71449 service.go:2126] Unexpected error: pod on node1 still serves traffic: 
    <context.deadlineExceededError>: 
    context deadline exceeded
    
        {}
[FAILED] pod on node1 still serves traffic: context deadline exceeded

The client runs in kind-worker2

I1119 04:40:04.963881 71449 resource.go:175] pause-pod-1    kind-worker2  Running         [{PodReadyToStartContainers True 0001-01-01 00:00:00 +0000 UTC 2024-11-19 04:39:04 +0000 UTC  } {Initialized True 0001-01-01 00:00:00 +0000 UTC 2024-11-19 04:39:02 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2024-11-19 04:39:04 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2024-11-19 04:39:04 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2024-11-19 04:39:02 +0000 UTC  }]

The rule for the service can not be added because the sync is failing constantly

2024-11-19T04:39:32.738428609Z stderr F I1119 04:39:32.734032       1 proxier.go:1174] "Syncing nftables rules" ipFamily="IPv4"
2024-11-19T04:39:32.906117415Z stderr F I1119 04:39:32.902204       1 proxier.go:1794] "Reloading service nftables data" ipFamily="IPv4" numServices=82 numEndpoints=88
2024-11-19T04:39:33.035933438Z stderr F E1119 04:39:33.034689       1 proxier.go:1805] "nftables sync failed" err=<
2024-11-19T04:39:33.035955957Z stderr F 	/dev/stdin:419:1-122: Error: Could not process rule: File exists
2024-11-19T04:39:33.035961651Z stderr F 	add element ip kube-proxy no-endpoint-nodeports { tcp . 30301 comment "services-7322/svc-proxy-terminating:http" : drop }
2024-11-19T04:39:33.035966413Z stderr F 	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-19T04:39:33.036117951Z stderr F  > ipFamily="IPv4"

This should be a blocker for GA, there are 52 nftables sync failed errors on https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1858725882303090688/artifacts/kind-worker2/pods/kube-system_kube-proxy-nddrb_99e30ca6-ba3d-419a-873d-1b942b553f5e/kube-proxy/0.log that are impacting Services basic functionality

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 19, 2024
@thockin thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 21, 2024
@npinaeva
Copy link
Member

@aojea @danwinship it looks like kernel version update is the reason for these failures
on 11-15 kernel version was updated from 5.15.0-1061-gke to 5.15.0-1067-gke and that is the first time (IFAICS) when we found this problem. We have already seen that kube-proxy does correct transactions and gets unexpected errors.
Is there a way to reach out to the team that builds those gke kernels and check what changes were introduced around netfilter?

@aojea
Copy link
Member Author

aojea commented Nov 25, 2024

I think that COS 105 LTS are the ones based on kernel 5.15, @SergeyKanzhelev @yujuhong can you help me to connect the kernel version to the COS image version?

https://cloud.google.com/container-optimized-os/docs/release-notes

@BenTheElder @ameukam @mauriciopoppe is there any plan to update the node pools we use on CI to a more modern version of the kernel?

@ameukam
Copy link
Member

ameukam commented Nov 25, 2024

@aojea We use Ubuntu (22.04) for the nodepools :
https://github.com/kubernetes/k8s.io/blob/6f350eef158a5f0268adee8395c9827b5553a55b/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L100

and they are auto-upgraded based on the GKE Version (currently v1.30.5-gke.1443001) .
Maybe ask internally (or bobbypage) which GKE version has the kernel version you are looking for ?
AFAIK:
GKE 1.31: ubuntu 22.04 (Kernel 5.17.x)
GKE 1.32: Ubunut 24.04 (Kernel 6.8.x)

@danwinship
Copy link
Contributor

5.15 should be new enough... it just seems like maybe this particular build has a bad backport or something...

@mauriciopoppe
Copy link
Member

https://github.com/kubernetes/k8s.io/blob/6f350eef158a5f0268adee8395c9827b5553a55b/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L100 seems to be the prow builder VM but not a GCE nodepool VM. However because this is a kind cluster I believe that the k8s cluster would be built on top of the prow builder VM.

From #128829 (comment) I believe the next question is for the GCE nodepool VMs but I'll try to answer it:

I think that COS 105 LTS are the ones based on kernel 5.15, @SergeyKanzhelev @yujuhong can you help me to connect the kernel version to the COS image version?

The last time I remember there was an update of the COS version for data plane nodes created on top of GCE VMs it was through kubernetes/test-infra#31016. It's been a while since then so I'm not sure if there were more changes, if testing on COS the kernel mapping is as follows:

  • COS 109 - 6.1
  • COS 113 - 6.1
  • COS 117 - 6.6

You can find additional details in https://cloud.google.com/container-optimized-os/docs/release-notes.

For the prow builder VM and kind, https://github.com/kubernetes/k8s.io/blob/6f350eef158a5f0268adee8395c9827b5553a55b/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L100 points to a GKE nodepool running Ubuntu 22.04 which uses kernel 5.15 as mentioned in #128829 (comment).

I can give more details about the Ubuntu 22.04 version, where can I see the prow builder VM creation logs or the terraform logs? I'd like to know the GKE version it was used to create the nodepool, the reason is that the GKE version would change over time because of release_channel = "REGULAR".

@npinaeva
Copy link
Member

@BenTheElder
Copy link
Member

@BenTheElder @ameukam @mauriciopoppe is there any plan to update the node pools we use on CI to a more modern version of the kernel?

We use managed clusters where possible (because we don't have a lot of time to operate Kubernetes vs develop it), and that includes the OS image in the GKE case, so that cadence is automated and tied to the GKE release channel.

We have currently opted for Ubuntu, because it had IPv6 kernel modules (available, but not loaded) when COS did not, years ago, and we've been using it ever since setting up the first IPv6 (kind) jobs. It's possible we could switch to COS but ...

If we need to test on specific Kernel versions, we should implement that directly (disposable VMs etc where we explicitly control this in the CI config).

The kind jobs running within the CI pods currently run primarily on the main GKE CI cluster but could be on EKS if we need to shift costs or maybe someday, with more funding ,another provider. I'm not sure what kernels we have on EKS currently but it should be something that works with a stable Kubernetes release.

@BenTheElder
Copy link
Member

I can give more details about the Ubuntu 22.04 version, where can I see the prow builder VM creation logs or the terraform logs? I'd like to know the GKE version it was used to create the nodepool, the reason is that the GKE version would change over time because of release_channel = "REGULAR".

These are autoscaled and I'm not sure we're retaining the VM logs longterm ... but it would be in the k8s-infra-prow-build GCP project, IAM is controlled in github.com/kubernetes/k8s.io, there are groups in groups/ with various permissions that are controlled by PR.

From the job aartifacts podinfo.json it ran on node gke-prow-build-pool5-2021092812495606-e8f905a4-jswl

pool5 is currently 1.30.5-gke.1443001 on regular channel.

@mauriciopoppe
Copy link
Member

1.30.5-gke.1443001 -> ubuntu-gke-2204-1-30-v20240929 with the following release notes:

gsutil cat gs://ubuntu-os-gke-cloud/ubuntu-gke-2204-1-30-v20240929.release-notes-summary.json
{
    "build_info_serial": "20240929",
    "release_version": "22.04.5 LTS (Jammy Jellyfish)",
    "release_version_id": "22.04",
    "release_version_codename": "jammy",
    "linux_gke_version": "5.15.0-1067-gke",
    "gke_nvidia_driver_version": "535.161.07",
    "gke_nvidia_driver_gpu_versions": {},
    "gke_variant": "1.30",
    "gke_image_track": "stable",
    "gke_image_cgroups_variant": "2",
    "runc_version": "1.1.7-0ubuntu1~22.04.2",
    "containerd_version": "1.7.22-0ubuntu0~22.04.1~gke1",
    "docker_version": "20.10.12-0ubuntu3",
    "architecture": "amd64"
}

With the current setup where the GKE version comes from the regular channel we could possible bump to Ubuntu 24.04 (which uses kernel 6.8) in GKE 1.32 however this is still up in the air, we are aware of at least one blocker issue that if not addressed would make us stay with Ubuntu 22.04 in GKE 1.32. Anyways, in the happy path where we adopt it in GKE 1.32 I think that a good estimate for it to be available in Regular is the last week of January 2025.

#128829 (comment) has a good insight on a possible kernel diff that might have introduced flakiness, @npinaeva we can ask Canonical about kernel diffs between these two versions 5.15.0-1061-gke to 5.15.0-1067-gke that might have introduced a problem visible in this test.

@npinaeva
Copy link
Member

Filed a bug https://bugs.launchpad.net/ubuntu/+bug/2089699, let's see how it goes

@aojea
Copy link
Member Author

aojea commented Nov 26, 2024

#128829 (comment) has a good insight on a possible kernel diff that might have introduced flakiness, @npinaeva we can ask Canonical about kernel diffs between these two versions 5.15.0-1061-gke to 5.15.0-1067-gke that might have introduced a problem visible in this te

@mauriciopoppe appreciate if you can get some eyes on this from Canonical, in case you have any contact

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Nov 26, 2024

Yes, I'll meet them tomorrow and I'll talk to them about bugs.launchpad.net/ubuntu/+bug/2089699, thanks for filing it.

@mauriciopoppe
Copy link
Member

Canonical is aware of bugs.launchpad.net/ubuntu/+bug/2089699 and were looking for info to reproduce this. Having a similar environment would be hard given that they don't have access to GKE but I mentioned that they just need a GCE VM using the image that they provide to GKE, that's for the VM setup only but for the test it might be difficult to set it up in the same way as https://git.k8s.io/test-infra/config/jobs/kubernetes/sig-network/sig-network-kind.yaml.

Is there a way to provide a shell script that can run the test? E.g. turn the Pod spec into something that can be run through a script or through a regular Pod? e.g.

  pod_spec:
    containers:
    - command:
      - wrapper.sh
      - bash
      - -c
      - curl -sSL https://kind.sigs.k8s.io/dl/latest/linux-amd64.tgz | tar xvfz -
        -C "${PATH%%:*}/" && e2e-k8s.sh
      env:
      - name: GINKGO_TOLERATE_FLAKES
        value: "n"
      - name: KUBE_PROXY_MODE
        value: nftables
      - name: PARALLEL
        value: "true"
      - name: FOCUS
        value: \[sig-network\]|\[Conformance\]
      - name: SKIP
        value: Alpha|Beta|LoadBalancer|Disruptive|Flaky|IPv6DualStack|Networking-IPv6|Internet.connection
      - name: GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: E2E_GOOGLE_APPLICATION_CREDENTIALS
        value: /etc/service-account/service-account.json
      - name: GOOGLE_APPLICATION_CREDENTIALS_DEPRECATED
        value: Migrate to workload identity, contact sig-testing
      - name: DOCKER_IN_DOCKER_ENABLED
        value: "true"
      - name: GOPROXY
        value: https://proxy.golang.org
      - name: AWS_ROLE_SESSION_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      image: gcr.io/k8s-staging-test-infra/krte:v20241128-8df65c072f-master
      name: ""
      resources:
        limits:
          cpu: "4"
          memory: 9Gi
        requests:
          cpu: "4"
          memory: 9Gi
      securityContext:
        privileged: true
      volumeMounts:
      - mountPath: /etc/service-account
        name: service
        readOnly: true
      - mountPath: /docker-graph
        name: docker-graph
      - mountPath: /var/lib/docker
        name: docker-root
      - mountPath: /lib/modules
        name: modules
        readOnly: true
      - mountPath: /sys/fs/cgroup
        name: cgroup
    volumes:
    - name: service
      secret:
        secretName: service-account
    - emptyDir: {}
      name: docker-graph
    - emptyDir: {}
      name: docker-root
    - hostPath:
        path: /lib/modules
        type: Directory
      name: modules
    - hostPath:
        path: /sys/fs/cgroup
        type: Directory
      name: cgroup

@aojea
Copy link
Member Author

aojea commented Dec 2, 2024

@mauriciopoppe

  1. Create GCE VM with the corresponding version
  2. Install docker and kind latest version https://kind.sigs.k8s.io/docs/user/quick-start#installing-from-release-binaries
  3. Use the following manifest to create a cluster https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-kind-network-nftables/1863436984039510016/artifacts/kind-config.yaml
    kind create cluster --config kind-config.yaml
  4. Run the e2e test https://gist.github.com/aojea/097b5a8418fbbcb2b55e72a4cf6e62f7#file-conformance-sh
    Replace FOCUS AND SKIP in that script and set KUBERNETES_VERSION to v1.31.1 per example
FOCUS=\[sig-network\]|\[Conformance\]
SKIP=[Serial\]|Alpha|Beta|LoadBalancer|Disruptive|Flaky|IPv6DualStack|Networking-IPv6|Internet.connection

@danwinship
Copy link
Contributor

in this log the first error is only a minute and a half after kube-proxy starts up, so probably we should be able to just give them a set of nft commands to run to simulate kube-proxy-like behavior. I can make a test PR to generate that...

@danwinship
Copy link
Contributor

OK, this script just replays the nftables commands from one of the runs in #129061 up to the point where it failed. (Maybe it would have been better to get more commands after that, though if they try running it and it doesn't fail, they could just try immediately running again...)

@aojea
Copy link
Member Author

aojea commented Dec 9, 2024

Ok, it seems the bug was identified and a fix released https://lists.ubuntu.com/archives/kernel-team/2024-December/155790.html

@mauriciopoppe should we avoid these versions of ubuntu with this bug?

@BenTheElder
Copy link
Member

If we know a fix is available and we identify the GKE version I can look into manually requesting a node pool upgrade ahead of the automated schedule.

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Dec 9, 2024

Canonical update: It's reproducible in the generic kernels, patchset submitted for review by the mailing list (maybe the one on lists.ubuntu.com/archives/kernel-team/2024-December/155790.html as pointed out in #128829 (comment)).

I'll post another update about when we get a new GKE version with the fix. Usually, after Canonical creates a new image for GKE it takes ~2 weeks for the GKE version to be available for manual upgrade.

@aojea
Copy link
Member Author

aojea commented Dec 16, 2024

This is getting worse since last days https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20nftables,%20master

The kernel is still the same Kernel Version: 5.15.0-1067-gke , more load on CI?

@aojea
Copy link
Member Author

aojea commented Dec 16, 2024

@ameukam @BenTheElder do we have an alternative to move to a more stable environment in our CI? more jobs are failing and I don't like to be blind because of this known bug

@BenTheElder
Copy link
Member

@ameukam @BenTheElder do we have an alternative to move to a more stable environment in our CI? more jobs are failing and I don't like to be blind because of this known bug

We could probably setup a COS nodepool, initially with a taint/label and start pinning some of these jobs?

I think we should probably consider generally migrating, COS is the recommended default and the only reason we switched previously was to get the ipv6 jobs working (since we could modprobe ipv6 iptables on ubuntu even though they also weren't loaded by default). IIRC COS has ipv6 now?

@aojea
Copy link
Member Author

aojea commented Dec 17, 2024

IIRC COS has ipv6 now?

yeah, COS93 IIUIC https://cloud.google.com/container-optimized-os/docs/release-notes/m93

@mauriciopoppe
Copy link
Member

A negative point of using COS is that Google Cloud might be the primary (or only) user so signals wouldn't give help other companies. A general purpose OS like Ubuntu is good common ground.

Ideally, it'd be nice to increase our test dimensions to tests against both OSes or have an additional test dimension against COS that would give signals to Google Cloud.

In addition, do upstream tests install packages through DaemonSets or startup scripts after the node is booted? COS has a read only filesystem so it's not possible to install packages on runtime, that might be a limitation/blocker if there's a migration to COS.

@BenTheElder
Copy link
Member

A negative point of using COS is that Google Cloud might be the primary (or only) user so signals wouldn't give help other companies. A general purpose OS like Ubuntu is good common ground.

Ideally, it'd be nice to increase our test dimensions to tests against both OSes or have an additional test dimension against COS that would give signals to Google Cloud.

While true, we have other e2e jobs for that which create "real" cloud clusters, and on the EKS prow build cluster (where some of the other jobs run) we're using Amazon Linux so ...

I don't think we should attempt to increase OS coverage with kind, we're not really meaning to test the kernel with it, and the userspace doesn't match at all (kind's userspace is currently debian), amongst other things.

For node_e2e and GCE/EC2 cluster e2e we do run with other OSes.

Are we running any other jobs with nftables enabled yet?

In addition, do upstream tests install packages through DaemonSets or startup scripts after the node is booted? COS has a read only filesystem so it's not possible to install packages on runtime, that might be a limitation/blocker if there's a migration to COS.

We're running prow's general CI pods like "run the unit tests in this container", it's just that in this case one of those pods also runs a Kubernetes cluster and happens to share the host kernel (and any issues with that kernel), but it's not intended for kernel coverage versus testing the kubernetes components against each other.


I'm ~out until EOY starting tomorrow, but anyone could go ahead and take a stab at the cluster terraform and prowjob updates, @upodroid recently enabled atlantis for GCP terraform so it should auto-deploy now, I think?

@aojea
Copy link
Member Author

aojea commented Dec 18, 2024

Are we running any other jobs with nftables enabled yet?

oh, I almost forget about it , I can't remember now which version of COS will have the necessary kernel modules kubernetes/test-infra#32485

EDIT

it will be in COS 113 Build 18244-85-14 https://cloud.google.com/container-optimized-os/docs/release-notes/m113#cos-113-18244-85-14_

What version of COS do we have in our CI now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants