e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

piosz · 2016-05-23T13:39:47Z

http://kubekins.dls.corp.google.com/view/Submit%20Queue/job/kubernetes-e2e-gce-scalability/7739
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/7739

I0523 12:41:07.247157    3375 server.go:640] Event(api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"node-problem-detector-v0.1-5tz3a", UID:"625467b6-20e1-11e6-842a-42010af00002", APIVersion:"v1", ResourceVersion:"232", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/node-problem-detector:v0.1\""

It seems this is due to fail while pulling image but I let @Random-Liu to investigate deeper.

cc @dchen1107

The text was updated successfully, but these errors were encountered:

Random-Liu · 2016-05-23T20:36:50Z

This seems like #25277, I saw this in log:

May 23 05:38:36.201: INFO: At {2016-05-23 05:27:49 -0700 PDT} - event for node-problem-detector-v0.1-5tz3a: {kubelet e2e-scalability-minion-pwnb} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with ErrImagePull: "net/http: request canceled"

dchen1107 · 2016-05-23T20:42:27Z

Yes it is dupe of #25277

dchen1107 · 2016-05-23T20:43:29Z

I am closing this issue, and we should fix the image pulling flakiness separately. Hope proposal #25784 can get us to a better situation.

yujuhong · 2016-05-23T23:09:26Z

I don't think #25784 will solve problems like this, where image pulling failed at the cluster bringup stage. The node in question tried pulling the image 5 times and all failed.

OK. I think this is caused by the docker operation timeout that was added to our docker client.

When setting the default context timeout to "12s", the request timed out after 12s exactly. The error "net/http: request canceled" matches what we've seen in the tests since the timeout was added.

$ time ./image_puller --image=busybox
Connecting to docker on unix:///var/run/docker.sock...
Pulling image "busybox"...
First phase succeeded
Failed to pull image: net/http: request canceled

real    0m12.080s
user    0m0.014s
sys 0m0.080s

At cluster bringup, the registry can be slow due to all the pulling requests from the nodes, which in turn causes the 2 minute timeout. This also prevents kubelet from pulling any image that needs > 2min (default timeout) to pull. I'll send a PR to disable timeout for image pulling for now.

p.s.1: @Random-Liu, the timeout applies to the entire image pulling operation in my tests, as opposed to the initial response you mentioned last time.
p.s.2: The simple image pulling code used for reproducing this issue: https://gist.github.com/yujuhong/fafeb49c7b69d40573feeabdc7789142

Random-Liu · 2016-05-23T23:21:04Z

@yujuhong I checked the code of engine-api again, you are right, the timeout is applied on the whole image pulling:
https://github.com/docker/engine-api/blob/master/client/transport/cancellable/cancellable.go#L77-L84

I didn't notice that the client hold the context itself in another goroutine, and checks if the context is Done(). Whenever the context is done, it will cancel the request, that's why we get the error "net/http: request canceled".

For normal case, we should get the timeout from the Context. But for image pulling, we didn't check the context, then the "net/http: request canceled" error surfaces.

yujuhong · 2016-05-23T23:44:37Z

For normal case, we should get the timeout from the Context. But for image pulling, we didn't check the context, then the "net/http: request canceled" error surfaces.

Yes, I also noticed that :-)

Random-Liu · 2016-05-23T23:45:59Z

@yujuhong Thanks a lot for finding this. Sorry I didn't catch this before. :(
If so, maybe we should make the image pull timeout configurable or update image pulling progress to surface the original image pulling stuck issue.

yujuhong · 2016-05-23T23:54:45Z

@yujuhong Thanks a lot for finding this. Sorry I didn't catch this before. :(

No worries. In only checked because the test failed to pull the same image for 5 times with the "request canceled" message :)

If so, maybe we should make the image pull timeout configurable or update image pulling progress to surface the original image pulling stuck issue.

Checking image progress sounds reasonable to me. Maybe we can detect and cancel the request when the progress is stalled for a prolonged period of time.

Random-Liu · 2016-05-23T23:56:41Z

@yujuhong I'll send a PR to at least properly log the image pulling progress.

yujuhong · 2016-05-23T23:59:51Z

SGTM. Thanks!

@yujuhong

Automatic merge from submit-queue Kubelet: Periodically reporting image pulling progress in log Addresses #26075 (comment) and #26122 (comment). This PR changes kube_docker_client to log pulling progress every *10* seconds. We can't print all progress messages into the log, because there are too many. So I make it report newest progress every 10 seconds to reduce log spam. If the image pulling is too slow or stuck, we'll see image pulling progress unchanged or changed little overtime. The following is the result if I set the reporting interval to *1* second. ``` I0524 00:53:26.189086 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "17b6a9e179d7: Pulling fs layer " I0524 00:53:27.189082 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "202e40f8bb3a: Download complete " I0524 00:53:28.189160 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=> ] 1.474 MB/48.35 MB" I0524 00:53:29.189156 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [====> ] 3.931 MB/48.35 MB" I0524 00:53:30.189089 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=========> ] 8.847 MB/48.35 MB" I0524 00:53:31.189089 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [==================> ] 18.19 MB/48.35 MB" I0524 00:53:32.189076 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=======================================> ] 38.34 MB/48.35 MB" I0524 00:53:33.189106 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [=============> ] 12.78 MB/48.35 MB" I0524 00:53:34.189067 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================> ] 33.42 MB/48.35 MB" I0524 00:53:35.189083 145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================================>] 48.35 MB/48.35 MB" I0524 00:53:35.376667 145099 kube_docker_client.go:254] Finish pulling image "ubuntu:latest": "Status: Downloaded newer image for ubuntu:latest" ``` Ref image pulling related issue #19077. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]() @yujuhong @dchen1107 /cc @kubernetes/sig-node

piosz added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016

piosz assigned Random-Liu May 23, 2016

dchen1107 closed this as completed May 23, 2016

This was referenced May 23, 2016

e2e flake: slow and unreliable image pulling #25277

Closed

docker: don't set timeout for image pulling requests #26122

Merged

Random-Liu mentioned this issue May 24, 2016

Kubelet: Periodically reporting image pulling progress in log #26145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

piosz commented May 23, 2016

Random-Liu commented May 23, 2016

dchen1107 commented May 23, 2016

dchen1107 commented May 23, 2016

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016 •

edited

Loading

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016 •

edited

Loading

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016

yujuhong commented May 23, 2016

e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

Comments

piosz commented May 23, 2016

Random-Liu commented May 23, 2016

dchen1107 commented May 23, 2016

dchen1107 commented May 23, 2016

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016 • edited Loading

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016 • edited Loading

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016

yujuhong commented May 23, 2016

Random-Liu commented May 23, 2016 •

edited

Loading

Random-Liu commented May 23, 2016 •

edited

Loading