[Failing Test or Infra] gce-master-scale-performance #126366

drewhagen · 2024-07-25T19:51:52Z

Which jobs are failing?

sig-release-master-informing

gce-master-scale-performance

Which tests are failing?

ClusterLoaderV2 test

Since when has it been failing?

07/23, since commit 77c3859 tracing back to this PR:
#126270
(Actually may not be related - see comments)

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure (if possible)

That Pull Request is an automated testing change, and it looks like in the logs:

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
Changing logfiles to be world-readable for download
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
12 skipped lines...
Or, to investigate an IAP tunneling issue:
gcloud compute ssh gce-scale-cluster-minion-group-4-8p26 --project=k8s-infra-e2e-scale-5k-project --zone=us-east1-b --troubleshoot --tunnel-through-iap
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
External IP address was not found; defaulting to using IAP tunneling.
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
Recommendation: To check for possible causes of SSH connectivity issues and get

There may be a cluster infra issue with authentication now. (403?) The PR description above does mention auth, so perhaps that change broke this job.

Anything else we need to know?

Prow links:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488

Relevant SIG(s)

/sig testing
/sig test-infra

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-07-25T19:51:56Z

@drewhagen: The label(s) sig/test-infra cannot be applied, because the repository doesn't have them.

In response to this:

Which jobs are failing?

sig-release-master-informing

gce-master-scale-performance

Which tests are failing?

ClusterLoaderV2 test

Since when has it been failing?

07/23, since commit 77c3859 tracing back to this PR:
#126270

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure (if possible)

That Pull Request is an automated testing change, and it looks like in the logs:

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
Changing logfiles to be world-readable for download
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
12 skipped lines...
Or, to investigate an IAP tunneling issue:
gcloud compute ssh gce-scale-cluster-minion-group-4-8p26 --project=k8s-infra-e2e-scale-5k-project --zone=us-east1-b --troubleshoot --tunnel-through-iap
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
External IP address was not found; defaulting to using IAP tunneling.
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
Recommendation: To check for possible causes of SSH connectivity issues and get

There may be a cluster infra issue with authentication now. (403?) The PR description above does mention auth, so perhaps that change broke this job.

Anything else we need to know?

No response

Relevant SIG(s)

/sig testing
/sig test-infra

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2024-07-25T20:07:33Z

I wouldn't expect an e2e test change (77c3859) to result in the cluster not coming up like https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328, it's probably some other commit.

Note: SIG k8s-infra owns providing the e2e test projects etc but does not own kubetest(2) or kube-up.sh or scale tests.

SIG testing sort of owns the kube-up.sh scripts since cluster lifecycle disowned them but they are still used for most e2e CI. It's not really clearly laid out at the moment.

This looks like it could be an infra change that dropped permissions for --tunnel-through-iap but ... I'm not aware of any such change occurring.

Scale testing is
/sig scalability

BenTheElder · 2024-07-25T20:09:01Z

We haven't recently merged any changes to the project's GCP configurations: https://github.com/kubernetes/k8s.io/commits/main/

There are a lot of commits between the pass and fail though 69eee1c...77c3859

BenTheElder · 2024-07-25T21:01:14Z

I don't think related, but we have had issues in #123884 for a while now :/

Looks like maybe we have some bandwidth issues in SIG Scalability

BenTheElder · 2024-07-25T21:07:07Z

Also if we compare the last pass:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815069662522642432

to the first fail:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

and click on "test" at the bottom of the page

the job pod used gcr.io/k8s-staging-test-infra/kubekins-e2e:v20240705-131cd74733-master in both cases, so it's not a test-infra image change.

hakuna-matatah · 2024-07-26T17:00:52Z

Haven't looked into the failure yet but it seems like AWS scale tests are succeeding - https://testgrid.k8s.io/sig-scalability-aws
So, it seems like it is isolated to Google cloud ?

hakuna-matatah · 2024-07-26T17:21:55Z

I see worker nodes coming up fine but it appears that deployments/csi-gce-pd-controller is unable to come up. This is GCE specific controller dependency they have in their test here iiuc.
In other words its failing even before Cl2 load test steps kicks off. It could be that this deployment csi-pd-controller is not coming up as expected. Someone from Google prolly need to see why that is the case.

error: timed out waiting for the condition on deployments/csi-gce-pd-controller
2024/07/25 17:26:51 process.go:155: Step '/home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --experimental-gcp-snapshot-prometheus-disk=true --experimental-prometheus-disk-snapshot-name=ci-kubernetes-e2e-gce-scale-performance-1816519155113791488 --experimental-prometheus-snapshot-to-report-dir=true --nodes=5000 --prometheus-scrape-node-exporter --provider=gce --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/ignore_known_gce_container_restarts.yaml --testoverrides=./testing/overrides/5000_nodes.yaml' finished in 5m3.36374669s
2024/07/25 17:26:51 e2e.go:569: Dumping logs from nodes to GCS directly at path: gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488
2024/07/25 17:26:51 process.go:153: Running: /workspace/log-dump.sh /logs/artifacts gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488
Checking for custom logdump instances, if any
Using gce provider, skipping check for LOG_DUMP_SSH_KEY and LOG_DUMP_SSH_USER
Project: k8s-infra-e2e-scale-5k-project
Network Project: k8s-infra-e2e-scale-5k-project
Zone: us-east1-b
Dumping logs temporarily to '/tmp/tmp.b63TVTDp80/logs'. Will upload to 'gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488' later.
Dumping logs from master locally to '/tmp/tmp.b63TVTDp80/logs'
Trying to find master named 'gce-scale-cluster-master'
Looking for address 'gce-scale-cluster-master-ip'
Looking for address 'gce-scale-cluster-master-internal-ip'
Using master: gce-scale-cluster-master (external IP: 35.196.144.78; internal IP: 10.40.0.2)
Changing logfiles to be world-readable for download
Copying 'kube-apiserver.log kube-apiserver-audit.log kube-scheduler.log cloud-controller-manager.log kube-controller-manager.log etcd.log etcd-events.log glbc.log cluster-autoscaler.log kube-addon-manager.log konnectivity-server.log fluentd.log kubelet.cov cl2-* startupscript.log' from gce-scale-cluster-master

cc: @dims

BenTheElder · 2024-07-26T17:22:21Z

cc @mattcary

BenTheElder · 2024-07-26T18:15:13Z

That's true in the most recent run, but in the run before (first failed run) that I see:

F0723 17:32:55.142143 18130 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

I'm not sure it's definitely the PD-CSI controller @hakuna-matatah, and I'm also not convinced it's only GCE specific.

The cluster on AWS does not have identical config, cloud integrations aside, as I understand it, and most of the scale jobs are still on kube-up.

BenTheElder · 2024-07-26T18:17:22Z

first failed run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

"message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: pods "prometheus-k8s-0" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount "prometheus-k8s" not found",

mattcary · 2024-07-26T18:30:41Z

+1 the run from Ben's comment does not seem to be pdcsi related.

hakuna-matatah · 2024-07-26T19:43:31Z

I only looked at the latest run here - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488/build-log.txt

And my read that I posted earlier was based off that run ^^^

hakuna-matatah · 2024-07-26T20:02:21Z

first failed run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

"message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: pods "prometheus-k8s-0" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount "prometheus-k8s" not found",

I didn't look at this test run but based on the error above, it seems like prom-stack unable to come up.
I don't have access to kick off the test run and live debug GCE runs.

Parsing APIServer/KCM logs and audit logs might help here. And those artifacts(logs etc) are uploaded to gcs bucket which i don't have access to.

cc: @BenTheElder @dims

wendy-ha18 · 2024-07-26T23:13:04Z

cc @kubernetes/release-team-release-signal

BenTheElder · 2024-07-26T23:22:48Z

Parsing APIServer/KCM logs and audit logs might help here. And those artifacts(logs etc) are uploaded to gcs bucket which i don't have access to.

Hmm, that should be public, isn't it under k8s-infra-scalability-tests-logs ?

e.g.

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

= "artifacts" =>

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/

=> get $ARTIFACTS folder instead of overall artifacts (... yeah ...), by clicking to /artifacts

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/artifacts/

=> get master-and-node-logs.link.txt (since scale tests self upload logs to another bucket)

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/artifacts/master-and-node-logs.link.txt

=> contains:

gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Which can then be accessed clunkily by transforming to
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

BenTheElder · 2024-07-26T23:28:19Z

I took a poke earlier, but scale tests are not what I'm usually looking at. It feels like we broke something in 69eee1c...77c3859

BenTheElder · 2024-07-26T23:29:35Z

Maybe one of the feature gate removals?

hakuna-matatah · 2024-07-26T23:52:49Z

gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Which can then be accessed clunkily by transforming to
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Ah! thanks for this ^^^, I got lost after getting to the link here - thought that would need access as it shows the bucket name and relative path :D

Also, another important point is that AWS is running on 1.30 version while i think GCE is running on 1.31 version
So failure mode might also come from 1.31 code path

BenTheElder · 2024-07-27T00:05:45Z

Ah! thanks for this ^^^, I got lost after getting to the link here - thought that would need access as it shows the bucket name and relative path :D

Yeah, it's ... confusing, I'd change it but I'm hesitant to break something else 😅

Maybe we can dump a small HTML file with a clickable link alongside it or something ...

Also, another important point is that AWS is running on 1.30 version while i think GCE is running on 1.31 version

Ohhh that's also surprising, can we start tracking ci/latest builds?

hakuna-matatah · 2024-07-27T00:14:21Z

From the KCM logs I see relevant prometheus-pod errors for the test run - https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

E0723 17:17:50.151411      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.156552      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.162782      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.199891      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"

It is coming from statefulset controller here

hakuna-matatah · 2024-07-27T00:19:02Z

I see some changes made to that controller in 1.31 code path, not sure if any of those could be related or if this is red herring but i haven't come across this error used in the past.

I see there were 3 commits to that file iiuc for 1.31

Not sure if any of these are contributing factors to that error or if its a red herring, i don't seem to have come across that error from controller before.

Need to dig into it further to understand what's going on there.

BenTheElder · 2024-07-27T00:21:38Z

Thank you! switched track to #126401 into the weekend ...

We can tag in SIG-Apps folks if we think it's a bug in the statefulset controller. Let's start with raising it in #sig-apps.

BenTheElder · 2024-07-29T16:38:47Z

/triage accepted

BenTheElder · 2024-07-29T20:25:44Z

kubernetes/test-infra#33140 hopefully fixes the prometheus setup for now (though we might want to revisit why resource requirements increased ...), thanks!

working on the rate limiting issue, we've been discussing options in #sig-k8s-infra and I'm trying #126448 + a pull-through cache in the 5k scale project

Argh4k · 2024-07-30T07:57:15Z

Hi folks, I'm not sure whether increasing the node for prometheus is necessary. I've checked two runs which failed on prometheus:

Let's start with 07/27.

From build-log.txt, the prometheus was complaining that:

I0727 17:32:01.661424   18026 util.go:101] 1173/1272 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.69.9.4:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-nrmpp service:kube-dns] down}
I0727 17:32:02.135570   18026 util.go:101] 1173/1272 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.69.48.4:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-bfm8h service:kube-dns] down}
...
F0727 17:32:02.168849   18026 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

So the prometheus was up, but it was waiting for all targets to become ready and some coredns pods were not getting up. Just to confirm that, we are also dumping the events for prometheus pod and this is what was dumped:

I0727 17:32:02.168807   18026 prometheus.go:498] {
  "metadata": {
    "resourceVersion": "492013"
  },
  "items": [
    {
      "metadata": {
        "name": "prometheus-k8s.17e621634330b9c4",
        "namespace": "monitoring",
        "uid": "ed359d87-1e22-4025-9c64-5364f2e358f6",
        "resourceVersion": "456828",
        "creationTimestamp": "2024-07-27T17:16:57Z",
        "managedFields": [
          {
            "manager": "kube-controller-manager",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2024-07-27T17:16:57Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:count": {},
              "f:firstTimestamp": {},
              "f:involvedObject": {},
              "f:lastTimestamp": {},
              "f:message": {},
              "f:reason": {},
              "f:reportingComponent": {},
              "f:source": {
                "f:component": {}
              },
              "f:type": {}
            }
          }
        ]
      },
      "involvedObject": {
        "kind": "StatefulSet",
        "namespace": "monitoring",
        "name": "prometheus-k8s",
        "uid": "200a5c8f-828f-457d-a2bc-368e935a4ce3",
        "apiVersion": "apps/v1",
        "resourceVersion": "460931"
      },
      "reason": "SuccessfulCreate",
      "message": "create Claim prometheus-k8s-db-prometheus-k8s-0 Pod prometheus-k8s-0 in StatefulSet prometheus-k8s success",
      "source": {
        "component": "statefulset-controller"
      },
      "firstTimestamp": "2024-07-27T17:16:57Z",
      "lastTimestamp": "2024-07-27T17:16:57Z",
      "count": 1,
      "type": "Normal",
      "eventTime": null,
      "reportingComponent": "statefulset-controller",
      "reportingInstance": ""
    },
    {
      "metadata": {
        "name": "prometheus-k8s.17e62163436caee5",
        "namespace": "monitoring",
        "uid": "56bd585d-7351-4be2-80d0-4883fbf893dd",
        "resourceVersion": "456829",
        "creationTimestamp": "2024-07-27T17:16:57Z",
        "managedFields": [
          {
            "manager": "kube-controller-manager",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2024-07-27T17:16:57Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:count": {},
              "f:firstTimestamp": {},
              "f:involvedObject": {},
              "f:lastTimestamp": {},
              "f:message": {},
              "f:reason": {},
              "f:reportingComponent": {},
              "f:source": {
                "f:component": {}
              },
              "f:type": {}
            }
          }
        ]
      },
      "involvedObject": {
        "kind": "StatefulSet",
        "namespace": "monitoring",
        "name": "prometheus-k8s",
        "uid": "200a5c8f-828f-457d-a2bc-368e935a4ce3",
        "apiVersion": "apps/v1",
        "resourceVersion": "460931"
      },
      "reason": "SuccessfulCreate",
      "message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s successful",
      "source": {
        "component": "statefulset-controller"
      },
      "firstTimestamp": "2024-07-27T17:16:57Z",
      "lastTimestamp": "2024-07-27T17:16:57Z",
      "count": 1,
      "type": "Normal",
      "eventTime": null,
      "reportingComponent": "statefulset-controller",
      "reportingInstance": ""
    }
  ]
}

There are no events about pod being killed for OOM or something like that. Instead let's just focus on one of the coredns pods mentioned. From kube-scheduler logs we can see that it was bound to node gce-scale-cluster-minion-group-kfvx

I0727 17:10:49.606580      11 schedule_one.go:314] "Successfully bound pod to node" pod="kube-system/coredns-56f647f7-bfm8h" node="gce-scale-cluster-minion-group-kfvx" evaluatedNodes=1164 feasibleNodes=580

Looking at the kubelet logs for this node, we can clearly see that it could not schedule pods (not only coredns) due to the 429 errors from registry.k8s.io:

E0727 17:31:57.560521    1871 pod_workers.go:1301] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"csi-driver-registrar\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": failed to resolve reference \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-node-driver-registrar/manifests/v2.3.0: 429 Too Many Requests\", failed to \"StartContainer\" for \"gce-pd-driver\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": failed to resolve reference \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver/manifests/v1.3.4: 429 Too Many Requests\"]" pod="gce-pd-csi-driver/csi-gce-pd-node-tn4cf" podUID="3d912d83-2a6a-47f1-baca-ae6a60cabff5"

...

E0727 17:32:09.916820    1871 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" image="registry.k8s.io/coredns/coredns:v1.11.1"

...

E0727 17:32:09.916910    1871 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" image="registry.k8s.io/coredns/coredns:v1.11.1"

...

E0727 17:32:09.917161    1871 kuberuntime_manager.go:1272] "Unhandled Error" err="container &Container{Name:coredns,Image:registry.k8s.io/coredns/coredns:v1.11.1,Command:[],Args:[-conf /etc/coredns/Corefile],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:dns,HostPort:0,ContainerPort:53,Protocol:UDP,HostIP:,},ContainerPort{Name:dns-tcp,HostPort:0,ContainerPort:53,Protocol:TCP,HostIP:,},ContainerPort{Name:metrics,HostPort:0,ContainerPort:9153,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{memory: {{314572800 0} {<nil>} 300Mi BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{73400320 0} {<nil>} 70Mi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:config-volume,ReadOnly:true,MountPath:/etc/coredns,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-sbcnf,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:{0 8080 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/ready,Port:{0 8181 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[ALL],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod coredns-56f647f7-bfm8h_kube-system(8fe6b4cc-4ba9-47df-a77e-eb5f340bd21a): ErrImagePull: failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" logger="UnhandledError"

And now let's jump to run from 07/23

A similar case, prometheus complaining about targets not ready:

I0723 17:32:24.688535   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.68.189.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-rzj88 service:kube-dns] down}
I0723 17:32:54.682048   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.66.203.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-qjzgn service:kube-dns] down}
I0723 17:32:55.108466   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.65.160.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-sgs2x service:kube-dns] down}
...
F0723 17:32:55.142143   18130 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

Sample pod bound to node gce-scale-cluster-minion-group-2-k9nl:

I0723 17:10:05.335899      11 schedule_one.go:315] "Successfully bound pod to node" pod="kube-system/coredns-56f647f7-sgs2x" node="gce-scale-cluster-minion-group-2-k9nl" evaluatedNodes=518 feasibleNodes=302

And kubelet logs on this node:

E0723 17:36:23.422838    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\": failed to resolve reference \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests\"" pod="kube-system/coredns-56f647f7-sgs2x" podUID="fd9b48cd-659f-440f-b5d5-6304f6645e91"
...
E0723 17:37:00.771960    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with ImagePullBackOff: \"Back-off pulling image \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\"\"" pod="kube-system/coredns-56f647f7-sgs2x" podUID="fd9b48cd-659f-440f-b5d5-6304f6645e91"
E0723 17:37:08.767307    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"konnectivity-agent\" with ImagePullBackOff: \"Back-off pulling image \\\"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\\\"\"" pod="kube-system/konnectivity-agent-8jhjr" podUID="8a6bb329-a6b4-466d-b3ce-68cab68a2a81"

Argh4k · 2024-07-30T07:59:02Z

And the latest run (07/29) that failed with timed out waiting for the condition on deployments/csi-gce-pd-controller.

The scheduler bouned the pod to node:

I0729 17:17:21.536131      12 schedule_one.go:314] "Successfully bound pod to node" pod="gce-pd-csi-driver/csi-gce-pd-controller-7d7d9bfcb9-fvff9" node="gce-scale-cluster-minion-group-4-vn54" evaluatedNodes=500 feasibleNodes=500

And from kubelet logs on gce-scale-cluster-minion-group-4-vn54 node:

E0729 17:17:23.216292    1870 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/sig-storage/csi-provisioner:v2.2.1\": failed to resolve reference \"registry.k8s.io/sig-storage/csi-provisioner:v2.2.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-provisioner/manifests/v2.2.1: 429 Too Many Requests" image="registry.k8s.io/sig-storage/csi-provisioner:v2.2.1"

...

E0729 17:24:00.087764    1870 pod_workers.go:1301] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"csi-driver-registrar\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": failed to resolve reference \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-node-driver-registrar/manifests/v2.3.0: 429 Too Many Requests\", failed to \"StartContainer\" for \"gce-pd-driver\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": failed to resolve reference \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver/manifests/v1.3.4: 429 Too Many Requests\"]" pod="gce-pd-csi-driver/csi-gce-pd-node-lc42d" podUID="638ef2fe-2f3b-42b6-9cc1-72a9af4c4ac1"

...

E0729 17:24:16.449244    1870 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19\": failed to resolve reference \"registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19\": unexpected status from HEAD request to https://registry.k8s.io/v2/node-problem-detector/node-problem-detector/manifests/v0.8.19: 429 Too Many Requests" image="registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19"

...

E0729 17:24:25.457238    1870 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\": failed to resolve reference \"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\": unexpected status from HEAD request to https://registry.k8s.io/v2/kas-network-proxy/proxy-agent/manifests/v0.30.3: 429 Too Many Requests" image="registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3"

Argh4k · 2024-07-30T07:59:41Z

So in my opinion all the failures are actually root caused by rate limiting from registry.k8s.io and we do not need to increase the size of heapster node for prometheus.

BenTheElder · 2024-07-30T21:53:37Z

Thanks @Argh4k

#126448 + some infra work will let us configure the 5k jobs with a private pull-through cache for these images (rather than globally raising the limits, we'd hope other large users would do similar)

Hopefully before the next run we'll have this enabled, just getting some last testing ready (on small scale kube-up clusters).

BenTheElder · 2024-07-31T01:07:18Z

kubernetes/test-infra#33149 + #126448 should mitigate this, though we'll have to see when an actual 5k run happens after merging those.

It's possible we may need to request a quota increase in this new pull-through-cache when actually faced with 5k node jobs.
I don't think so though given that the default quota is 60,000/minute for reads which is much higher than the registry.k8s.io rate limits (where the per-user quota we enforce needs to be a subset of our global capacity and we're paying for egress).

BenTheElder · 2024-07-31T04:51:05Z

There are only two periodic jobs using --gcp-project-type=scalability-scale-project, both are failing and stopping very early due to these issues.

https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-*

So I've manually kicked off another run of this one with the latest configuration, after the above PRs merged.

That run is https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1818509054998220800

BenTheElder · 2024-07-31T05:05:59Z

We still chewed right through that quota:

So I've requested doubling it.

Edit: we hit that again, which isn't breaking anything, but ideally we would have enough capacity versus our usage, so I requested increasing the write and read limits in the relevant region again.

BenTheElder · 2024-07-31T05:11:32Z

We do still have kubernetes/test-infra#29500 but the cluster seems to be up otherwise.

I0731 05:09:07.373496 18091 prometheus.go:299] Prometheus stack set up successfully

Will check back in a bit ...

Argh4k · 2024-07-31T08:26:37Z

And I can see that the manual run passed. Great work!

dims · 2024-07-31T11:05:28Z

for reference:

thanks a ton @BenTheElder for fixing it and for @hakuna-matatah / @Argh4k for digging deeper into logs.

dims · 2024-07-31T11:06:44Z

/priority critical-urgent
/milestone v1.31
/close

k8s-ci-robot · 2024-07-31T11:06:49Z

@dims: Closing this issue.

In response to this:

/priority critical-urgent
/milestone v1.31
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2024-07-31T18:02:52Z

One more follow-up: kubernetes/test-infra#33172

BenTheElder · 2024-07-31T18:04:23Z

https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-correctness&width=20

Note that this one is appears green in testgrid but isn't passing yet (look at the set of testcases). It will hopefully pass on the next run.

There is currently another (scheduled) run of -performance running.

BenTheElder · 2024-07-31T20:13:13Z

We have two green runs of gce-master-scale-performance, I just started ci-kubernetes-e2e-gce-scale-correctness with the latest config to confirm if we've unbroken that one or not (I didn't see a tracking issue for that one yet but it has been broken and received the same mitigations)

https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-*

BenTheElder · 2024-07-31T21:10:49Z

For the private scale tests pull-through cache, quota is now at 240,000 (4x default) for reads, 36,000 for writes (2x default), of which we're using maybe 60-100,000 reads typically during heavy spikes 🙃 (writes are down for now because the cache is warm)

BenTheElder · 2024-08-01T00:09:29Z

We have a green run of ci-kubernetes-e2e-gce-scale-correctness now:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1818741392688549888

And what looks like a full set of test cases being reported again.

drewhagen added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 25, 2024

drewhagen added this to CI Signal (SIG Release / Release Team) Jul 25, 2024

drewhagen moved this to FAILING in CI Signal (SIG Release / Release Team) Jul 25, 2024

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 25, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 25, 2024

drewhagen added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jul 25, 2024

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 25, 2024

Vyom-Yadav moved this from FAILING to CRITICAL in CI Signal (SIG Release / Release Team) Jul 26, 2024

hakuna-matatah mentioned this issue Jul 29, 2024

Increase resources for Prom to avoid OOM'ng kubernetes/test-infra#33140

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2024

BenTheElder mentioned this issue Jul 29, 2024

kube-up.sh: drop unnecessary legacy mirror config, enable injecting registry mirror #126448

Merged

BenTheElder mentioned this issue Jul 29, 2024

set pull through cache for 5k node scale tests kubernetes/test-infra#33149

Merged

k8s-ci-robot added this to the v1.31 milestone Jul 31, 2024

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 31, 2024

k8s-ci-robot closed this as completed Jul 31, 2024

github-project-automation bot moved this from CRITICAL to RESOLVED in CI Signal (SIG Release / Release Team) Jul 31, 2024

BenTheElder mentioned this issue Jul 31, 2024

Revert "Increase resources for Prom to avoid OOM'ng" kubernetes/test-infra#33172

Merged

Argh4k mentioned this issue Aug 2, 2024

parametrize all occurrences of pause image in load config kubernetes/perf-tests#2803

Merged

[Failing Test or Infra] gce-master-scale-performance #126366

[Failing Test or Infra] gce-master-scale-performance #126366

Comments

drewhagen commented Jul 25, 2024 • edited Loading

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Jul 25, 2024

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

BenTheElder commented Jul 25, 2024

BenTheElder commented Jul 25, 2024

BenTheElder commented Jul 25, 2024

BenTheElder commented Jul 25, 2024

hakuna-matatah commented Jul 26, 2024 • edited Loading

hakuna-matatah commented Jul 26, 2024

BenTheElder commented Jul 26, 2024

BenTheElder commented Jul 26, 2024 • edited Loading

BenTheElder commented Jul 26, 2024 • edited Loading

mattcary commented Jul 26, 2024

hakuna-matatah commented Jul 26, 2024

hakuna-matatah commented Jul 26, 2024 • edited Loading

wendy-ha18 commented Jul 26, 2024

BenTheElder commented Jul 26, 2024 • edited Loading

BenTheElder commented Jul 26, 2024

BenTheElder commented Jul 26, 2024

hakuna-matatah commented Jul 26, 2024

BenTheElder commented Jul 27, 2024

hakuna-matatah commented Jul 27, 2024

hakuna-matatah commented Jul 27, 2024 • edited Loading

BenTheElder commented Jul 27, 2024

BenTheElder commented Jul 29, 2024

BenTheElder commented Jul 29, 2024

Argh4k commented Jul 30, 2024

Let's start with 07/27.

And now let's jump to run from 07/23

Argh4k commented Jul 30, 2024

Argh4k commented Jul 30, 2024

BenTheElder commented Jul 30, 2024

BenTheElder commented Jul 31, 2024

BenTheElder commented Jul 31, 2024

BenTheElder commented Jul 31, 2024 • edited Loading

BenTheElder commented Jul 31, 2024

Argh4k commented Jul 31, 2024

dims commented Jul 31, 2024 • edited Loading

dims commented Jul 31, 2024

k8s-ci-robot commented Jul 31, 2024

BenTheElder commented Jul 31, 2024

BenTheElder commented Jul 31, 2024

BenTheElder commented Jul 31, 2024 • edited Loading

BenTheElder commented Jul 31, 2024

BenTheElder commented Aug 1, 2024

drewhagen commented Jul 25, 2024 •

edited

Loading

hakuna-matatah commented Jul 26, 2024 •

edited

Loading

BenTheElder commented Jul 26, 2024 •

edited

Loading

BenTheElder commented Jul 26, 2024 •

edited

Loading

hakuna-matatah commented Jul 26, 2024 •

edited

Loading

BenTheElder commented Jul 26, 2024 •

edited

Loading

hakuna-matatah commented Jul 27, 2024 •

edited

Loading

BenTheElder commented Jul 31, 2024 •

edited

Loading

dims commented Jul 31, 2024 •

edited

Loading

BenTheElder commented Jul 31, 2024 •

edited

Loading