Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error in kubelet summary API #29007

Closed
mwielgus opened this issue Jul 15, 2016 · 12 comments
Closed

Internal error in kubelet summary API #29007

mwielgus opened this issue Jul 15, 2016 · 12 comments
Assignees
Labels
area/kubelet area/kubelet-api lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mwielgus
Copy link
Contributor

Happens consistently on one of Kubernetes 1.2.4 nodes.

I0715 13:23:05.094212 3451 handler.go:239] HTTP InternalServerError: Internal Error: unable to find data for container /a8ee18[...]50b318
I0715 13:23:05.094339 3451 server.go:1096] GET /stats/summary/: (6.708488ms) 500
goroutine 634432 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).recordStatus(0xc208939b20, 0x1f4)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:214 +0xa6
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc208939b20, 0x1f4)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:193 +0x32
github.com/emicklei/go-restful.(*Response).WriteHeader(0xc20915cae0, 0x1f4)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/response.go:191 +0x48
github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc20915cae0, 0x1f4, 0xc208f68100, 0x73, 0x0, 0x0)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/response.go:180 +0x128
k8s.io/kubernetes/pkg/kubelet/server/stats.handleError(0xc20915cae0, 0x7efee4a37040, 0xc2087eea40)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:240 +0x29a
k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).handleSummary(0xc2082b12a0, 0xc2087df260, 0xc20915cae0)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:153 +0x78
k8s.io/kubernetes/pkg/kubelet/server/stats.*handler.(k8s.io/kubernetes/pkg/kubelet/server/stats.handleSummary)·fm(0xc2087df260, 0xc20915cae0)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:70 +0x3b
github.com/emicklei/go-restful.(*Container).dispatch(0xc2082865a0, 0x7efee487d3a8, 0xc208939b20, 0xc208cc60d0)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/container.go:249 +0xf5e
@mwielgus mwielgus added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/kubelet area/kubelet-api sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 15, 2016
@mwielgus
Copy link
Contributor Author

This is P0. Any updates?

@mwielgus
Copy link
Contributor Author

cc: @dchen1107 @vishh @davidopp

@matchstick
Copy link
Contributor

@yujuhong

@yujuhong
Copy link
Contributor

@mwielgus, how often does this happen and what's your cluster setup (e.g., docker version, os distro, cloud provider, etc)?
Could you share the kubelet.log?

@timstclair
Copy link

We've seen this before, and Kubernetes 1.3 includes a partial fix: #25933

I'm not sure if we ever uncovered the underlying issue, but I'm digging into it now.

@mwielgus
Copy link
Contributor Author

This issue was discovered on a client GKE cluster during a hangout debugging session. We asked for kubelet.log but AFAIK it has not been provided yet. We also asked them about the docker version and it was relatively old - something along 1.9 but unfortunately I didn't write it down and can be wrong - definitely not the current one. I don't know the os version. I will add you to Google internal thread so that you can ask the involved support engineer more detailed questions.

@yujuhong
Copy link
Contributor

@mwielgus if it's GKE, the node should be running on the containerVM image with docker 1.9. Let's continue the discussion in the internal thread and see if we can uncover more information. Thanks!

@timstclair
Copy link

Digging into this more, if any of the subsystems in the libcontainer cgroup manager fail to collect stats, then stats collection is aborted for the container. Without looking at the logs I can't say which system is failing, but my guess is that something is causing collection to fail on that container. Previously cAdvisor was not robust to these types of failures, but the latest version included in 1.3 will log the failure and continue on in a best-effort fashion.

I'm not sure what the remaining action items are here, other than try to track down logs (tracked on internal thread). Can we close this issue?

@matchstick
Copy link
Contributor

@yujuhong @mwielgus This does not feel like a P0 to me, can we move it to P2 if we decide to not close it?

@mwielgus
Copy link
Contributor Author

mwielgus commented Jul 19, 2016

For me a customer calling GKE support because an important K8S feature (HPA) is not working at all due to this bug is an indication of a P0 issue.

The main problem is not that some stack trace is written to some log but that no metrics are available for a running pod causing various hard to debug disruptions across the system (HPA, dashboards, metrics storage, upcoming kubectl top, any usage base scheduing we may have one day). I'm against calling it P2 because we most likely won't ever come back to it. Not to mention the GKE customer satisfaction.

cc: @piosz @jszczepkowski @fgrzadkowski

@matchstick
Copy link
Contributor

@piosz @jszczepkowski @fgrzadkowski @yujuhong @mwielgus @timstclair

@mwielgus and I communicated offline.

The larger issue of what the customer needs is the P0 problem that has to be addressed. It is not captured well in this issue.

The more narrow focus in this issue of the fact the metrics is not being collected is addressed by Tim's comments.

We are not going to drop the customer issue and are opening another issue to ensure the customer is satisfied. I am not sure this issue is the best place to resolve it. As we learn more we will update this issue. For now we are going to make it a P2 unless it turns out that the customer info causes us to raise the priority.

@matchstick matchstick added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jul 19, 2016
@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/kubelet-api lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

6 participants