-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: node-problem-detector has container with no ready status in scalability suite #26075
Comments
This seems like #25277, I saw this in log:
|
Yes it is dupe of #25277 |
I am closing this issue, and we should fix the image pulling flakiness separately. Hope proposal #25784 can get us to a better situation. |
I don't think #25784 will solve problems like this, where image pulling failed at the cluster bringup stage. The node in question tried pulling the image 5 times and all failed. OK. I think this is caused by the docker operation timeout that was added to our docker client. When setting the default context timeout to "12s", the request timed out after 12s exactly. The error "net/http: request canceled" matches what we've seen in the tests since the timeout was added. $ time ./image_puller --image=busybox
Connecting to docker on unix:///var/run/docker.sock...
Pulling image "busybox"...
First phase succeeded
Failed to pull image: net/http: request canceled
real 0m12.080s
user 0m0.014s
sys 0m0.080s At cluster bringup, the registry can be slow due to all the pulling requests from the nodes, which in turn causes the 2 minute timeout. This also prevents kubelet from pulling any image that needs > 2min (default timeout) to pull. I'll send a PR to disable timeout for image pulling for now. p.s.1: @Random-Liu, the timeout applies to the entire image pulling operation in my tests, as opposed to the initial response you mentioned last time. |
@yujuhong I checked the code of engine-api again, you are right, the timeout is applied on the whole image pulling: I didn't notice that the client hold the context itself in another goroutine, and checks if the context is For normal case, we should get the timeout from the Context. But for image pulling, we didn't check the context, then the "net/http: request canceled" error surfaces. |
Yes, I also noticed that :-) |
@yujuhong Thanks a lot for finding this. Sorry I didn't catch this before. :( |
No worries. In only checked because the test failed to pull the same image for 5 times with the "request canceled" message :)
Checking image progress sounds reasonable to me. Maybe we can detect and cancel the request when the progress is stalled for a prolonged period of time. |
@yujuhong I'll send a PR to at least properly log the image pulling progress. |
SGTM. Thanks! |
Automatic merge from submit-queue Kubelet: Periodically reporting image pulling progress in log Addresses #26075 (comment) and #26122 (comment). This PR changes kube_docker_client to log pulling progress every *10* seconds. We can't print all progress messages into the log, because there are too many. So I make it report newest progress every 10 seconds to reduce log spam. If the image pulling is too slow or stuck, we'll see image pulling progress unchanged or changed little overtime. The following is the result if I set the reporting interval to *1* second. ``` I0524 00:53:26.189086 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "17b6a9e179d7: Pulling fs layer " I0524 00:53:27.189082 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "202e40f8bb3a: Download complete " I0524 00:53:28.189160 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=> ] 1.474 MB/48.35 MB" I0524 00:53:29.189156 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [====> ] 3.931 MB/48.35 MB" I0524 00:53:30.189089 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=========> ] 8.847 MB/48.35 MB" I0524 00:53:31.189089 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [==================> ] 18.19 MB/48.35 MB" I0524 00:53:32.189076 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=======================================> ] 38.34 MB/48.35 MB" I0524 00:53:33.189106 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [=============> ] 12.78 MB/48.35 MB" I0524 00:53:34.189067 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================> ] 33.42 MB/48.35 MB" I0524 00:53:35.189083 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================================>] 48.35 MB/48.35 MB" I0524 00:53:35.376667 145099 kube_docker_client.go:254] Finish pulling image "ubuntu:latest": "Status: Downloaded newer image for ubuntu:latest" ``` Ref image pulling related issue #19077. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() @yujuhong @dchen1107 /cc @kubernetes/sig-node
http://kubekins.dls.corp.google.com/view/Submit%20Queue/job/kubernetes-e2e-gce-scalability/7739
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/7739
It seems this is due to fail while pulling image but I let @Random-Liu to investigate deeper.
cc @dchen1107
The text was updated successfully, but these errors were encountered: