Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: node-problem-detector has container with no ready status in scalability suite #26075

Closed
piosz opened this issue May 23, 2016 · 10 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@piosz
Copy link
Member

piosz commented May 23, 2016

http://kubekins.dls.corp.google.com/view/Submit%20Queue/job/kubernetes-e2e-gce-scalability/7739
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/7739

I0523 12:41:07.247157    3375 server.go:640] Event(api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"node-problem-detector-v0.1-5tz3a", UID:"625467b6-20e1-11e6-842a-42010af00002", APIVersion:"v1", ResourceVersion:"232", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with ImagePullBackOff: "Back-off pulling image \"gcr.io/google_containers/node-problem-detector:v0.1\""

It seems this is due to fail while pulling image but I let @Random-Liu to investigate deeper.

cc @dchen1107

@piosz piosz added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016
@Random-Liu
Copy link
Member

This seems like #25277, I saw this in log:

May 23 05:38:36.201: INFO: At {2016-05-23 05:27:49 -0700 PDT} - event for node-problem-detector-v0.1-5tz3a: {kubelet e2e-scalability-minion-pwnb} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with ErrImagePull: "net/http: request canceled"

@dchen1107
Copy link
Member

Yes it is dupe of #25277

@dchen1107
Copy link
Member

I am closing this issue, and we should fix the image pulling flakiness separately. Hope proposal #25784 can get us to a better situation.

@yujuhong
Copy link
Contributor

I don't think #25784 will solve problems like this, where image pulling failed at the cluster bringup stage. The node in question tried pulling the image 5 times and all failed.

OK. I think this is caused by the docker operation timeout that was added to our docker client.

When setting the default context timeout to "12s", the request timed out after 12s exactly. The error "net/http: request canceled" matches what we've seen in the tests since the timeout was added.

$ time ./image_puller --image=busybox
Connecting to docker on unix:///var/run/docker.sock...
Pulling image "busybox"...
First phase succeeded
Failed to pull image: net/http: request canceled

real    0m12.080s
user    0m0.014s
sys 0m0.080s

At cluster bringup, the registry can be slow due to all the pulling requests from the nodes, which in turn causes the 2 minute timeout. This also prevents kubelet from pulling any image that needs > 2min (default timeout) to pull. I'll send a PR to disable timeout for image pulling for now.

p.s.1: @Random-Liu, the timeout applies to the entire image pulling operation in my tests, as opposed to the initial response you mentioned last time.
p.s.2: The simple image pulling code used for reproducing this issue: https://gist.github.com/yujuhong/fafeb49c7b69d40573feeabdc7789142

@Random-Liu
Copy link
Member

Random-Liu commented May 23, 2016

@yujuhong I checked the code of engine-api again, you are right, the timeout is applied on the whole image pulling:
https://github.com/docker/engine-api/blob/master/client/transport/cancellable/cancellable.go#L77-L84

I didn't notice that the client hold the context itself in another goroutine, and checks if the context is Done(). Whenever the context is done, it will cancel the request, that's why we get the error "net/http: request canceled".

For normal case, we should get the timeout from the Context. But for image pulling, we didn't check the context, then the "net/http: request canceled" error surfaces.

@yujuhong
Copy link
Contributor

For normal case, we should get the timeout from the Context. But for image pulling, we didn't check the context, then the "net/http: request canceled" error surfaces.

Yes, I also noticed that :-)

@Random-Liu
Copy link
Member

Random-Liu commented May 23, 2016

@yujuhong Thanks a lot for finding this. Sorry I didn't catch this before. :(
If so, maybe we should make the image pull timeout configurable or update image pulling progress to surface the original image pulling stuck issue.

@yujuhong
Copy link
Contributor

@yujuhong Thanks a lot for finding this. Sorry I didn't catch this before. :(

No worries. In only checked because the test failed to pull the same image for 5 times with the "request canceled" message :)

If so, maybe we should make the image pull timeout configurable or update image pulling progress to surface the original image pulling stuck issue.

Checking image progress sounds reasonable to me. Maybe we can detect and cancel the request when the progress is stalled for a prolonged period of time.

@Random-Liu
Copy link
Member

@yujuhong I'll send a PR to at least properly log the image pulling progress.

@yujuhong
Copy link
Contributor

SGTM. Thanks!

k8s-github-robot pushed a commit that referenced this issue May 28, 2016
Automatic merge from submit-queue

Kubelet: Periodically reporting image pulling progress in log

Addresses #26075 (comment) and #26122 (comment).

This PR changes kube_docker_client to log pulling progress every *10* seconds. We can't print all progress messages into the log, because there are too many. So I make it report newest progress every 10 seconds to reduce log spam.
If the image pulling is too slow or stuck, we'll see image pulling progress unchanged or changed little overtime.

The following is the result if I set the reporting interval to *1* second.
```
I0524 00:53:26.189086  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "17b6a9e179d7: Pulling fs layer "
I0524 00:53:27.189082  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "202e40f8bb3a: Download complete "
I0524 00:53:28.189160  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=>                                                 ] 1.474 MB/48.35 MB"
I0524 00:53:29.189156  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [====>                                              ] 3.931 MB/48.35 MB"
I0524 00:53:30.189089  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=========>                                         ] 8.847 MB/48.35 MB"
I0524 00:53:31.189089  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [==================>                                ] 18.19 MB/48.35 MB"
I0524 00:53:32.189076  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Downloading [=======================================>           ] 38.34 MB/48.35 MB"
I0524 00:53:33.189106  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [=============>                                     ] 12.78 MB/48.35 MB"
I0524 00:53:34.189067  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================>                ] 33.42 MB/48.35 MB"
I0524 00:53:35.189083  145099 kube_docker_client.go:252] Pulling image "ubuntu:latest": "487bffc61de6: Extracting [==================================================>] 48.35 MB/48.35 MB"
I0524 00:53:35.376667  145099 kube_docker_client.go:254] Finish pulling image "ubuntu:latest": "Status: Downloaded newer image for ubuntu:latest"
```

Ref image pulling related issue #19077.

[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()

@yujuhong @dchen1107 
/cc @kubernetes/sig-node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants