Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test or Infra] gce-master-scale-performance #126366

Closed
drewhagen opened this issue Jul 25, 2024 · 49 comments
Closed

[Failing Test or Infra] gce-master-scale-performance #126366

drewhagen opened this issue Jul 25, 2024 · 49 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@drewhagen
Copy link
Member

drewhagen commented Jul 25, 2024

Which jobs are failing?

sig-release-master-informing

  • gce-master-scale-performance

Which tests are failing?

ClusterLoaderV2 test

Since when has it been failing?

07/23, since commit 77c3859 tracing back to this PR:
#126270
(Actually may not be related - see comments)

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure (if possible)

That Pull Request is an automated testing change, and it looks like in the logs:

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
Changing logfiles to be world-readable for download
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
12 skipped lines...
Or, to investigate an IAP tunneling issue:
gcloud compute ssh gce-scale-cluster-minion-group-4-8p26 --project=k8s-infra-e2e-scale-5k-project --zone=us-east1-b --troubleshoot --tunnel-through-iap
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
External IP address was not found; defaulting to using IAP tunneling.
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
Recommendation: To check for possible causes of SSH connectivity issues and get

There may be a cluster infra issue with authentication now. (403?) The PR description above does mention auth, so perhaps that change broke this job.

Anything else we need to know?

Prow links:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488

Relevant SIG(s)

/sig testing
/sig test-infra

@drewhagen drewhagen added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jul 25, 2024
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 25, 2024
@k8s-ci-robot
Copy link
Contributor

@drewhagen: The label(s) sig/test-infra cannot be applied, because the repository doesn't have them.

In response to this:

Which jobs are failing?

sig-release-master-informing

  • gce-master-scale-performance

Which tests are failing?

ClusterLoaderV2 test

Since when has it been failing?

07/23, since commit 77c3859 tracing back to this PR:
#126270

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure (if possible)

That Pull Request is an automated testing change, and it looks like in the logs:

To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
Changing logfiles to be world-readable for download
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
12 skipped lines...
Or, to investigate an IAP tunneling issue:
gcloud compute ssh gce-scale-cluster-minion-group-4-8p26 --project=k8s-infra-e2e-scale-5k-project --zone=us-east1-b --troubleshoot --tunnel-through-iap
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
WARNING: 
To increase the performance of the tunnel, consider installing NumPy. For instructions,
please see https://cloud.google.com/iap/docs/using-tcp-forwarding#increasing_the_tcp_upload_bandwidth
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
External IP address was not found; defaulting to using IAP tunneling.
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
ERROR: (gcloud.compute.start-iap-tunnel) Error while connecting [4033: 'not authorized'].
Recommendation: To check for possible causes of SSH connectivity issues and get

There may be a cluster infra issue with authentication now. (403?) The PR description above does mention auth, so perhaps that change broke this job.

Anything else we need to know?

No response

Relevant SIG(s)

/sig testing
/sig test-infra

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 25, 2024
@drewhagen drewhagen added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Jul 25, 2024
@BenTheElder
Copy link
Member

I wouldn't expect an e2e test change (77c3859) to result in the cluster not coming up like https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328, it's probably some other commit.

Note: SIG k8s-infra owns providing the e2e test projects etc but does not own kubetest(2) or kube-up.sh or scale tests.

SIG testing sort of owns the kube-up.sh scripts since cluster lifecycle disowned them but they are still used for most e2e CI. It's not really clearly laid out at the moment.

This looks like it could be an infra change that dropped permissions for --tunnel-through-iap but ... I'm not aware of any such change occurring.

Scale testing is
/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 25, 2024
@BenTheElder
Copy link
Member

We haven't recently merged any changes to the project's GCP configurations: https://github.com/kubernetes/k8s.io/commits/main/

There are a lot of commits between the pass and fail though 69eee1c...77c3859

@BenTheElder
Copy link
Member

I don't think related, but we have had issues in #123884 for a while now :/

Looks like maybe we have some bandwidth issues in SIG Scalability

@BenTheElder
Copy link
Member

Also if we compare the last pass:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815069662522642432

to the first fail:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

and click on "test" at the bottom of the page

the job pod used gcr.io/k8s-staging-test-infra/kubekins-e2e:v20240705-131cd74733-master in both cases, so it's not a test-infra image change.

@hakuna-matatah
Copy link
Contributor

hakuna-matatah commented Jul 26, 2024

Haven't looked into the failure yet but it seems like AWS scale tests are succeeding - https://testgrid.k8s.io/sig-scalability-aws
So, it seems like it is isolated to Google cloud ?

@hakuna-matatah
Copy link
Contributor

I see worker nodes coming up fine but it appears that deployments/csi-gce-pd-controller is unable to come up. This is GCE specific controller dependency they have in their test here iiuc.
In other words its failing even before Cl2 load test steps kicks off. It could be that this deployment csi-pd-controller is not coming up as expected. Someone from Google prolly need to see why that is the case.

error: timed out waiting for the condition on deployments/csi-gce-pd-controller
2024/07/25 17:26:51 process.go:155: Step '/home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --experimental-gcp-snapshot-prometheus-disk=true --experimental-prometheus-disk-snapshot-name=ci-kubernetes-e2e-gce-scale-performance-1816519155113791488 --experimental-prometheus-snapshot-to-report-dir=true --nodes=5000 --prometheus-scrape-node-exporter --provider=gce --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/ignore_known_gce_container_restarts.yaml --testoverrides=./testing/overrides/5000_nodes.yaml' finished in 5m3.36374669s
2024/07/25 17:26:51 e2e.go:569: Dumping logs from nodes to GCS directly at path: gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488
2024/07/25 17:26:51 process.go:153: Running: /workspace/log-dump.sh /logs/artifacts gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488
Checking for custom logdump instances, if any
Using gce provider, skipping check for LOG_DUMP_SSH_KEY and LOG_DUMP_SSH_USER
Project: k8s-infra-e2e-scale-5k-project
Network Project: k8s-infra-e2e-scale-5k-project
Zone: us-east1-b
Dumping logs temporarily to '/tmp/tmp.b63TVTDp80/logs'. Will upload to 'gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488' later.
Dumping logs from master locally to '/tmp/tmp.b63TVTDp80/logs'
Trying to find master named 'gce-scale-cluster-master'
Looking for address 'gce-scale-cluster-master-ip'
Looking for address 'gce-scale-cluster-master-internal-ip'
Using master: gce-scale-cluster-master (external IP: 35.196.144.78; internal IP: 10.40.0.2)
Changing logfiles to be world-readable for download
Copying 'kube-apiserver.log kube-apiserver-audit.log kube-scheduler.log cloud-controller-manager.log kube-controller-manager.log etcd.log etcd-events.log glbc.log cluster-autoscaler.log kube-addon-manager.log konnectivity-server.log fluentd.log kubelet.cov cl2-* startupscript.log' from gce-scale-cluster-master

cc: @dims

@BenTheElder
Copy link
Member

cc @mattcary

@BenTheElder
Copy link
Member

BenTheElder commented Jul 26, 2024

That's true in the most recent run, but in the run before (first failed run) that I see:

F0723 17:32:55.142143 18130 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

I'm not sure it's definitely the PD-CSI controller @hakuna-matatah, and I'm also not convinced it's only GCE specific.

The cluster on AWS does not have identical config, cloud integrations aside, as I understand it, and most of the scale jobs are still on kube-up.

@BenTheElder
Copy link
Member

BenTheElder commented Jul 26, 2024

first failed run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

"message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: pods "prometheus-k8s-0" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount "prometheus-k8s" not found",

@mattcary
Copy link
Contributor

+1 the run from Ben's comment does not seem to be pdcsi related.

@hakuna-matatah
Copy link
Contributor

I only looked at the latest run here - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488/build-log.txt

And my read that I posted earlier was based off that run ^^^

@hakuna-matatah
Copy link
Contributor

hakuna-matatah commented Jul 26, 2024

first failed run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

"message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s failed error: pods "prometheus-k8s-0" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount "prometheus-k8s" not found",

I didn't look at this test run but based on the error above, it seems like prom-stack unable to come up.
I don't have access to kick off the test run and live debug GCE runs.

Parsing APIServer/KCM logs and audit logs might help here. And those artifacts(logs etc) are uploaded to gcs bucket which i don't have access to.

cc: @BenTheElder @dims

@Vyom-Yadav Vyom-Yadav moved this from FAILING to CRITICAL in CI Signal (SIG Release / Release Team) Jul 26, 2024
@wendy-ha18
Copy link
Member

cc @kubernetes/release-team-release-signal

@BenTheElder
Copy link
Member

BenTheElder commented Jul 26, 2024

Parsing APIServer/KCM logs and audit logs might help here. And those artifacts(logs etc) are uploaded to gcs bucket which i don't have access to.

Hmm, that should be public, isn't it under k8s-infra-scalability-tests-logs ?

e.g.

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

= "artifacts" =>

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/

=> get $ARTIFACTS folder instead of overall artifacts (... yeah ...), by clicking to /artifacts

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/artifacts/

=> get master-and-node-logs.link.txt (since scale tests self upload logs to another bucket)

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328/artifacts/master-and-node-logs.link.txt

=> contains:

gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Which can then be accessed clunkily by transforming to
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

@BenTheElder
Copy link
Member

I took a poke earlier, but scale tests are not what I'm usually looking at. It feels like we broke something in 69eee1c...77c3859

@BenTheElder
Copy link
Member

Maybe one of the feature gate removals?

@hakuna-matatah
Copy link
Contributor

gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Which can then be accessed clunkily by transforming to
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

Ah! thanks for this ^^^, I got lost after getting to the link here - thought that would need access as it shows the bucket name and relative path :D

Also, another important point is that AWS is running on 1.30 version while i think GCE is running on 1.31 version
So failure mode might also come from 1.31 code path

@BenTheElder
Copy link
Member

Ah! thanks for this ^^^, I got lost after getting to the link here - thought that would need access as it shows the bucket name and relative path :D

Yeah, it's ... confusing, I'd change it but I'm hesitant to break something else 😅

Maybe we can dump a small HTML file with a clickable link alongside it or something ...

Also, another important point is that AWS is running on 1.30 version while i think GCE is running on 1.31 version

Ohhh that's also surprising, can we start tracking ci/latest builds?

@hakuna-matatah
Copy link
Contributor

From the KCM logs I see relevant prometheus-pod errors for the test run - https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328

E0723 17:17:50.151411      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.156552      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.162782      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"
E0723 17:17:50.199891      11 stateful_set.go:438] "Unhandled Error" err="error syncing StatefulSet monitoring/prometheus-k8s, requeuing: pods \"prometheus-k8s-0\" is forbidden: error looking up service account monitoring/prometheus-k8s: serviceaccount \"prometheus-k8s\" not found" logger="UnhandledError"

It is coming from statefulset controller here

@hakuna-matatah
Copy link
Contributor

hakuna-matatah commented Jul 27, 2024

I see some changes made to that controller in 1.31 code path, not sure if any of those could be related or if this is red herring but i haven't come across this error used in the past.

I see there were 3 commits to that file iiuc for 1.31

Not sure if any of these are contributing factors to that error or if its a red herring, i don't seem to have come across that error from controller before.

Need to dig into it further to understand what's going on there.

@BenTheElder
Copy link
Member

Thank you! switched track to #126401 into the weekend ...

We can tag in SIG-Apps folks if we think it's a bug in the statefulset controller. Let's start with raising it in #sig-apps.

@BenTheElder
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2024
@BenTheElder
Copy link
Member

kubernetes/test-infra#33140 hopefully fixes the prometheus setup for now (though we might want to revisit why resource requirements increased ...), thanks!

working on the rate limiting issue, we've been discussing options in #sig-k8s-infra and I'm trying #126448 + a pull-through cache in the 5k scale project

@Argh4k
Copy link
Contributor

Argh4k commented Jul 30, 2024

Hi folks, I'm not sure whether increasing the node for prometheus is necessary. I've checked two runs which failed on prometheus:

Let's start with 07/27.

From build-log.txt, the prometheus was complaining that:

I0727 17:32:01.661424   18026 util.go:101] 1173/1272 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.69.9.4:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-nrmpp service:kube-dns] down}
I0727 17:32:02.135570   18026 util.go:101] 1173/1272 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.69.48.4:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-bfm8h service:kube-dns] down}
...
F0727 17:32:02.168849   18026 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

So the prometheus was up, but it was waiting for all targets to become ready and some coredns pods were not getting up. Just to confirm that, we are also dumping the events for prometheus pod and this is what was dumped:

I0727 17:32:02.168807   18026 prometheus.go:498] {
  "metadata": {
    "resourceVersion": "492013"
  },
  "items": [
    {
      "metadata": {
        "name": "prometheus-k8s.17e621634330b9c4",
        "namespace": "monitoring",
        "uid": "ed359d87-1e22-4025-9c64-5364f2e358f6",
        "resourceVersion": "456828",
        "creationTimestamp": "2024-07-27T17:16:57Z",
        "managedFields": [
          {
            "manager": "kube-controller-manager",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2024-07-27T17:16:57Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:count": {},
              "f:firstTimestamp": {},
              "f:involvedObject": {},
              "f:lastTimestamp": {},
              "f:message": {},
              "f:reason": {},
              "f:reportingComponent": {},
              "f:source": {
                "f:component": {}
              },
              "f:type": {}
            }
          }
        ]
      },
      "involvedObject": {
        "kind": "StatefulSet",
        "namespace": "monitoring",
        "name": "prometheus-k8s",
        "uid": "200a5c8f-828f-457d-a2bc-368e935a4ce3",
        "apiVersion": "apps/v1",
        "resourceVersion": "460931"
      },
      "reason": "SuccessfulCreate",
      "message": "create Claim prometheus-k8s-db-prometheus-k8s-0 Pod prometheus-k8s-0 in StatefulSet prometheus-k8s success",
      "source": {
        "component": "statefulset-controller"
      },
      "firstTimestamp": "2024-07-27T17:16:57Z",
      "lastTimestamp": "2024-07-27T17:16:57Z",
      "count": 1,
      "type": "Normal",
      "eventTime": null,
      "reportingComponent": "statefulset-controller",
      "reportingInstance": ""
    },
    {
      "metadata": {
        "name": "prometheus-k8s.17e62163436caee5",
        "namespace": "monitoring",
        "uid": "56bd585d-7351-4be2-80d0-4883fbf893dd",
        "resourceVersion": "456829",
        "creationTimestamp": "2024-07-27T17:16:57Z",
        "managedFields": [
          {
            "manager": "kube-controller-manager",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2024-07-27T17:16:57Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:count": {},
              "f:firstTimestamp": {},
              "f:involvedObject": {},
              "f:lastTimestamp": {},
              "f:message": {},
              "f:reason": {},
              "f:reportingComponent": {},
              "f:source": {
                "f:component": {}
              },
              "f:type": {}
            }
          }
        ]
      },
      "involvedObject": {
        "kind": "StatefulSet",
        "namespace": "monitoring",
        "name": "prometheus-k8s",
        "uid": "200a5c8f-828f-457d-a2bc-368e935a4ce3",
        "apiVersion": "apps/v1",
        "resourceVersion": "460931"
      },
      "reason": "SuccessfulCreate",
      "message": "create Pod prometheus-k8s-0 in StatefulSet prometheus-k8s successful",
      "source": {
        "component": "statefulset-controller"
      },
      "firstTimestamp": "2024-07-27T17:16:57Z",
      "lastTimestamp": "2024-07-27T17:16:57Z",
      "count": 1,
      "type": "Normal",
      "eventTime": null,
      "reportingComponent": "statefulset-controller",
      "reportingInstance": ""
    }
  ]
}

There are no events about pod being killed for OOM or something like that. Instead let's just focus on one of the coredns pods mentioned. From kube-scheduler logs we can see that it was bound to node gce-scale-cluster-minion-group-kfvx

I0727 17:10:49.606580      11 schedule_one.go:314] "Successfully bound pod to node" pod="kube-system/coredns-56f647f7-bfm8h" node="gce-scale-cluster-minion-group-kfvx" evaluatedNodes=1164 feasibleNodes=580

Looking at the kubelet logs for this node, we can clearly see that it could not schedule pods (not only coredns) due to the 429 errors from registry.k8s.io:

E0727 17:31:57.560521    1871 pod_workers.go:1301] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"csi-driver-registrar\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": failed to resolve reference \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-node-driver-registrar/manifests/v2.3.0: 429 Too Many Requests\", failed to \"StartContainer\" for \"gce-pd-driver\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": failed to resolve reference \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver/manifests/v1.3.4: 429 Too Many Requests\"]" pod="gce-pd-csi-driver/csi-gce-pd-node-tn4cf" podUID="3d912d83-2a6a-47f1-baca-ae6a60cabff5"

...

E0727 17:32:09.916820    1871 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" image="registry.k8s.io/coredns/coredns:v1.11.1"

...

E0727 17:32:09.916910    1871 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" image="registry.k8s.io/coredns/coredns:v1.11.1"

...

E0727 17:32:09.917161    1871 kuberuntime_manager.go:1272] "Unhandled Error" err="container &Container{Name:coredns,Image:registry.k8s.io/coredns/coredns:v1.11.1,Command:[],Args:[-conf /etc/coredns/Corefile],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:dns,HostPort:0,ContainerPort:53,Protocol:UDP,HostIP:,},ContainerPort{Name:dns-tcp,HostPort:0,ContainerPort:53,Protocol:TCP,HostIP:,},ContainerPort{Name:metrics,HostPort:0,ContainerPort:9153,Protocol:TCP,HostIP:,},},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{memory: {{314572800 0} {<nil>} 300Mi BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{73400320 0} {<nil>} 70Mi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:config-volume,ReadOnly:true,MountPath:/etc/coredns,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-sbcnf,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:{0 8080 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/ready,Port:{0 8181 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[ALL],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod coredns-56f647f7-bfm8h_kube-system(8fe6b4cc-4ba9-47df-a77e-eb5f340bd21a): ErrImagePull: failed to pull and unpack image \"registry.k8s.io/coredns/coredns:v1.11.1\": failed to resolve reference \"registry.k8s.io/coredns/coredns:v1.11.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests" logger="UnhandledError"

And now let's jump to run from 07/23

A similar case, prometheus complaining about targets not ready:

I0723 17:32:24.688535   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.68.189.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-rzj88 service:kube-dns] down}
I0723 17:32:54.682048   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.66.203.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-qjzgn service:kube-dns] down}
I0723 17:32:55.108466   18130 util.go:101] 1168/1262 targets are ready, example not ready target: {map[container:coredns endpoint:metrics instance:10.65.160.3:9153 job:kube-dns namespace:kube-system pod:coredns-56f647f7-sgs2x service:kube-dns] down}
...
F0723 17:32:55.142143   18130 clusterloader.go:330] Error while setting up prometheus stack: timed out waiting for the condition

Sample pod bound to node gce-scale-cluster-minion-group-2-k9nl:

I0723 17:10:05.335899      11 schedule_one.go:315] "Successfully bound pod to node" pod="kube-system/coredns-56f647f7-sgs2x" node="gce-scale-cluster-minion-group-2-k9nl" evaluatedNodes=518 feasibleNodes=302

And kubelet logs on this node:

E0723 17:36:23.422838    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\": failed to resolve reference \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/coredns/coredns/manifests/v1.11.1: 429 Too Many Requests\"" pod="kube-system/coredns-56f647f7-sgs2x" podUID="fd9b48cd-659f-440f-b5d5-6304f6645e91"
...
E0723 17:37:00.771960    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"coredns\" with ImagePullBackOff: \"Back-off pulling image \\\"registry.k8s.io/coredns/coredns:v1.11.1\\\"\"" pod="kube-system/coredns-56f647f7-sgs2x" podUID="fd9b48cd-659f-440f-b5d5-6304f6645e91"
E0723 17:37:08.767307    1877 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"konnectivity-agent\" with ImagePullBackOff: \"Back-off pulling image \\\"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\\\"\"" pod="kube-system/konnectivity-agent-8jhjr" podUID="8a6bb329-a6b4-466d-b3ce-68cab68a2a81"

@Argh4k
Copy link
Contributor

Argh4k commented Jul 30, 2024

And the latest run (07/29) that failed with timed out waiting for the condition on deployments/csi-gce-pd-controller.

The scheduler bouned the pod to node:

I0729 17:17:21.536131      12 schedule_one.go:314] "Successfully bound pod to node" pod="gce-pd-csi-driver/csi-gce-pd-controller-7d7d9bfcb9-fvff9" node="gce-scale-cluster-minion-group-4-vn54" evaluatedNodes=500 feasibleNodes=500

And from kubelet logs on gce-scale-cluster-minion-group-4-vn54 node:

E0729 17:17:23.216292    1870 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/sig-storage/csi-provisioner:v2.2.1\": failed to resolve reference \"registry.k8s.io/sig-storage/csi-provisioner:v2.2.1\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-provisioner/manifests/v2.2.1: 429 Too Many Requests" image="registry.k8s.io/sig-storage/csi-provisioner:v2.2.1"

...

E0729 17:24:00.087764    1870 pod_workers.go:1301] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"csi-driver-registrar\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": failed to resolve reference \\\"registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.3.0\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/sig-storage/csi-node-driver-registrar/manifests/v2.3.0: 429 Too Many Requests\", failed to \"StartContainer\" for \"gce-pd-driver\" with ErrImagePull: \"failed to pull and unpack image \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": failed to resolve reference \\\"registry.k8s.io/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver:v1.3.4\\\": unexpected status from HEAD request to https://registry.k8s.io/v2/cloud-provider-gcp/gcp-compute-persistent-disk-csi-driver/manifests/v1.3.4: 429 Too Many Requests\"]" pod="gce-pd-csi-driver/csi-gce-pd-node-lc42d" podUID="638ef2fe-2f3b-42b6-9cc1-72a9af4c4ac1"

...

E0729 17:24:16.449244    1870 kuberuntime_image.go:55] "Failed to pull image" err="failed to pull and unpack image \"registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19\": failed to resolve reference \"registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19\": unexpected status from HEAD request to https://registry.k8s.io/v2/node-problem-detector/node-problem-detector/manifests/v0.8.19: 429 Too Many Requests" image="registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19"

...

E0729 17:24:25.457238    1870 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\": failed to resolve reference \"registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3\": unexpected status from HEAD request to https://registry.k8s.io/v2/kas-network-proxy/proxy-agent/manifests/v0.30.3: 429 Too Many Requests" image="registry.k8s.io/kas-network-proxy/proxy-agent:v0.30.3"

@Argh4k
Copy link
Contributor

Argh4k commented Jul 30, 2024

So in my opinion all the failures are actually root caused by rate limiting from registry.k8s.io and we do not need to increase the size of heapster node for prometheus.

@BenTheElder
Copy link
Member

Thanks @Argh4k

#126448 + some infra work will let us configure the 5k jobs with a private pull-through cache for these images (rather than globally raising the limits, we'd hope other large users would do similar)

Hopefully before the next run we'll have this enabled, just getting some last testing ready (on small scale kube-up clusters).

@BenTheElder
Copy link
Member

kubernetes/test-infra#33149 + #126448 should mitigate this, though we'll have to see when an actual 5k run happens after merging those.

It's possible we may need to request a quota increase in this new pull-through-cache when actually faced with 5k node jobs.
I don't think so though given that the default quota is 60,000/minute for reads which is much higher than the registry.k8s.io rate limits (where the per-user quota we enforce needs to be a subset of our global capacity and we're paying for egress).

@BenTheElder
Copy link
Member

There are only two periodic jobs using --gcp-project-type=scalability-scale-project, both are failing and stopping very early due to these issues.

https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-*

So I've manually kicked off another run of this one with the latest configuration, after the above PRs merged.

That run is https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1818509054998220800

@BenTheElder
Copy link
Member

BenTheElder commented Jul 31, 2024

We still chewed right through that quota:

Screenshot 2024-07-30 at 10 05 18 PM

So I've requested doubling it.

Edit: we hit that again, which isn't breaking anything, but ideally we would have enough capacity versus our usage, so I requested increasing the write and read limits in the relevant region again.

@BenTheElder
Copy link
Member

We do still have kubernetes/test-infra#29500 but the cluster seems to be up otherwise.

I0731 05:09:07.373496 18091 prometheus.go:299] Prometheus stack set up successfully

Will check back in a bit ...

@Argh4k
Copy link
Contributor

Argh4k commented Jul 31, 2024

And I can see that the manual run passed. Great work!

@dims
Copy link
Member

dims commented Jul 31, 2024

@dims
Copy link
Member

dims commented Jul 31, 2024

/priority critical-urgent
/milestone v1.31
/close

@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jul 31, 2024
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 31, 2024
@k8s-ci-robot
Copy link
Contributor

@dims: Closing this issue.

In response to this:

/priority critical-urgent
/milestone v1.31
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@BenTheElder
Copy link
Member

One more follow-up: kubernetes/test-infra#33172

@BenTheElder
Copy link
Member

https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-correctness&width=20

Note that this one is appears green in testgrid but isn't passing yet (look at the set of testcases). It will hopefully pass on the next run.

There is currently another (scheduled) run of -performance running.

@BenTheElder
Copy link
Member

BenTheElder commented Jul 31, 2024

We have two green runs of gce-master-scale-performance, I just started ci-kubernetes-e2e-gce-scale-correctness with the latest config to confirm if we've unbroken that one or not (I didn't see a tracking issue for that one yet but it has been broken and received the same mitigations)

https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-*

@BenTheElder
Copy link
Member

For the private scale tests pull-through cache, quota is now at 240,000 (4x default) for reads, 36,000 for writes (2x default), of which we're using maybe 60-100,000 reads typically during heavy spikes 🙃 (writes are down for now because the cache is warm)

@BenTheElder
Copy link
Member

We have a green run of ci-kubernetes-e2e-gce-scale-correctness now:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1818741392688549888

And what looks like a full set of test cases being reported again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

8 participants