-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Failing Test or Infra] gce-master-scale-performance #126366
Comments
@drewhagen: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I wouldn't expect an e2e test change (77c3859) to result in the cluster not coming up like https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328, it's probably some other commit. Note: SIG k8s-infra owns providing the e2e test projects etc but does not own kubetest(2) or kube-up.sh or scale tests. SIG testing sort of owns the kube-up.sh scripts since cluster lifecycle disowned them but they are still used for most e2e CI. It's not really clearly laid out at the moment. This looks like it could be an infra change that dropped permissions for Scale testing is |
We haven't recently merged any changes to the project's GCP configurations: https://github.com/kubernetes/k8s.io/commits/main/ There are a lot of commits between the pass and fail though 69eee1c...77c3859 |
I don't think related, but we have had issues in #123884 for a while now :/ Looks like maybe we have some bandwidth issues in SIG Scalability |
Also if we compare the last pass: to the first fail: and click on "test" at the bottom of the page the job pod used |
Haven't looked into the failure yet but it seems like AWS scale tests are succeeding - https://testgrid.k8s.io/sig-scalability-aws |
I see worker nodes coming up fine but it appears that
cc: @dims |
cc @mattcary |
That's true in the most recent run, but in the run before (first failed run) that I see:
I'm not sure it's definitely the PD-CSI controller @hakuna-matatah, and I'm also not convinced it's only GCE specific. The cluster on AWS does not have identical config, cloud integrations aside, as I understand it, and most of the scale jobs are still on kube-up. |
first failed run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328
|
+1 the run from Ben's comment does not seem to be pdcsi related. |
I only looked at the latest run here - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488/build-log.txt And my read that I posted earlier was based off that run ^^^ |
I didn't look at this test run but based on the error above, it seems like prom-stack unable to come up. Parsing APIServer/KCM logs and audit logs might help here. And those artifacts(logs etc) are uploaded to gcs bucket which i don't have access to. cc: @BenTheElder @dims |
cc @kubernetes/release-team-release-signal |
Hmm, that should be public, isn't it under k8s-infra-scalability-tests-logs ? e.g. = "artifacts" => => get $ARTIFACTS folder instead of overall artifacts (... yeah ...), by clicking to /artifacts => get master-and-node-logs.link.txt (since scale tests self upload logs to another bucket) => contains:
Which can then be accessed clunkily by transforming to |
I took a poke earlier, but scale tests are not what I'm usually looking at. It feels like we broke something in 69eee1c...77c3859 |
Maybe one of the feature gate removals? |
Which can then be accessed clunkily by transforming to Ah! thanks for this ^^^, I got lost after getting to the link here - thought that would need access as it shows the bucket name and relative path :D Also, another important point is that AWS is running on |
Yeah, it's ... confusing, I'd change it but I'm hesitant to break something else 😅 Maybe we can dump a small HTML file with a clickable link alongside it or something ...
Ohhh that's also surprising, can we start tracking |
From the KCM logs I see relevant
It is coming from statefulset controller here |
I see some changes made to that controller in 1.31 code path, not sure if any of those could be related or if this is red herring but i haven't come across this error used in the past. I see there were 3 commits to that file iiuc for 1.31 Not sure if any of these are contributing factors to that error or if its a red herring, i don't seem to have come across that error from controller before. Need to dig into it further to understand what's going on there. |
Thank you! switched track to #126401 into the weekend ... We can tag in SIG-Apps folks if we think it's a bug in the statefulset controller. Let's start with raising it in #sig-apps. |
/triage accepted |
kubernetes/test-infra#33140 hopefully fixes the prometheus setup for now (though we might want to revisit why resource requirements increased ...), thanks! working on the rate limiting issue, we've been discussing options in #sig-k8s-infra and I'm trying #126448 + a pull-through cache in the 5k scale project |
Hi folks, I'm not sure whether increasing the node for prometheus is necessary. I've checked two runs which failed on prometheus:
Let's start with 07/27.From build-log.txt, the prometheus was complaining that:
So the prometheus was up, but it was waiting for all targets to become ready and some coredns pods were not getting up. Just to confirm that, we are also dumping the events for prometheus pod and this is what was dumped:
There are no events about pod being killed for OOM or something like that. Instead let's just focus on one of the coredns pods mentioned. From kube-scheduler logs we can see that it was bound to node
Looking at the kubelet logs for this node, we can clearly see that it could not schedule pods (not only coredns) due to the 429 errors from registry.k8s.io:
And now let's jump to run from 07/23A similar case, prometheus complaining about targets not ready:
Sample pod bound to node
And kubelet logs on this node:
|
And the latest run (07/29) that failed with The scheduler bouned the pod to node:
And from kubelet logs on gce-scale-cluster-minion-group-4-vn54 node:
|
So in my opinion all the failures are actually root caused by rate limiting from registry.k8s.io and we do not need to increase the size of heapster node for prometheus. |
Thanks @Argh4k #126448 + some infra work will let us configure the 5k jobs with a private pull-through cache for these images (rather than globally raising the limits, we'd hope other large users would do similar) Hopefully before the next run we'll have this enabled, just getting some last testing ready (on small scale kube-up clusters). |
kubernetes/test-infra#33149 + #126448 should mitigate this, though we'll have to see when an actual 5k run happens after merging those. It's possible we may need to request a quota increase in this new pull-through-cache when actually faced with 5k node jobs. |
There are only two periodic jobs using https://prow.k8s.io/?job=ci-kubernetes-e2e-gce-scale-* So I've manually kicked off another run of this one with the latest configuration, after the above PRs merged. |
We do still have kubernetes/test-infra#29500 but the cluster seems to be up otherwise.
Will check back in a bit ... |
And I can see that the manual run passed. Great work! |
for reference:
thanks a ton @BenTheElder for fixing it and for @hakuna-matatah / @Argh4k for digging deeper into logs. |
/priority critical-urgent |
@dims: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
One more follow-up: kubernetes/test-infra#33172 |
Note that this one is appears green in testgrid but isn't passing yet (look at the set of testcases). It will hopefully pass on the next run. There is currently another (scheduled) run of -performance running. |
We have two green runs of gce-master-scale-performance, I just started ci-kubernetes-e2e-gce-scale-correctness with the latest config to confirm if we've unbroken that one or not (I didn't see a tracking issue for that one yet but it has been broken and received the same mitigations) |
For the private scale tests pull-through cache, quota is now at 240,000 (4x default) for reads, 36,000 for writes (2x default), of which we're using maybe 60-100,000 reads typically during heavy spikes 🙃 (writes are down for now because the cache is warm) |
We have a green run of And what looks like a full set of test cases being reported again. |
Which jobs are failing?
sig-release-master-informing
Which tests are failing?
ClusterLoaderV2 test
Since when has it been failing?
07/23, since commit 77c3859 tracing back to this PR:
#126270
(Actually may not be related - see comments)
Testgrid link
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance
Reason for failure (if possible)
That Pull Request is an automated testing change, and it looks like in the logs:
There may be a cluster infra issue with authentication now. (403?) The PR description above does mention auth, so perhaps that change broke this job.
Anything else we need to know?
Prow links:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1815794460479459328
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1816519155113791488
Relevant SIG(s)
/sig testing
/sig test-infra
The text was updated successfully, but these errors were encountered: