Internal error in kubelet summary API #29007

mwielgus · 2016-07-15T15:29:17Z

Happens consistently on one of Kubernetes 1.2.4 nodes.

I0715 13:23:05.094212 3451 handler.go:239] HTTP InternalServerError: Internal Error: unable to find data for container /a8ee18[...]50b318
I0715 13:23:05.094339 3451 server.go:1096] GET /stats/summary/: (6.708488ms) 500
goroutine 634432 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).recordStatus(0xc208939b20, 0x1f4)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:214 +0xa6
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc208939b20, 0x1f4)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:193 +0x32
github.com/emicklei/go-restful.(*Response).WriteHeader(0xc20915cae0, 0x1f4)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/response.go:191 +0x48
github.com/emicklei/go-restful.(*Response).WriteErrorString(0xc20915cae0, 0x1f4, 0xc208f68100, 0x73, 0x0, 0x0)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/response.go:180 +0x128
k8s.io/kubernetes/pkg/kubelet/server/stats.handleError(0xc20915cae0, 0x7efee4a37040, 0xc2087eea40)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:240 +0x29a
k8s.io/kubernetes/pkg/kubelet/server/stats.(*handler).handleSummary(0xc2082b12a0, 0xc2087df260, 0xc20915cae0)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:153 +0x78
k8s.io/kubernetes/pkg/kubelet/server/stats.*handler.(k8s.io/kubernetes/pkg/kubelet/server/stats.handleSummary)·fm(0xc2087df260, 0xc20915cae0)
 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/stats/handler.go:70 +0x3b
github.com/emicklei/go-restful.(*Container).dispatch(0xc2082865a0, 0x7efee487d3a8, 0xc208939b20, 0xc208cc60d0)
 /go/src/k8s.io/kubernetes/Godeps/_workspace/src/github.com/emicklei/go-restful/container.go:249 +0xf5e

The text was updated successfully, but these errors were encountered:

mwielgus · 2016-07-19T09:23:53Z

This is P0. Any updates?

mwielgus · 2016-07-19T09:24:15Z

cc: @dchen1107 @vishh @davidopp

matchstick · 2016-07-19T16:03:23Z

@yujuhong

yujuhong · 2016-07-19T16:11:38Z

@mwielgus, how often does this happen and what's your cluster setup (e.g., docker version, os distro, cloud provider, etc)?
Could you share the kubelet.log?

timstclair · 2016-07-19T17:17:19Z

We've seen this before, and Kubernetes 1.3 includes a partial fix: #25933

I'm not sure if we ever uncovered the underlying issue, but I'm digging into it now.

mwielgus · 2016-07-19T17:18:46Z

This issue was discovered on a client GKE cluster during a hangout debugging session. We asked for kubelet.log but AFAIK it has not been provided yet. We also asked them about the docker version and it was relatively old - something along 1.9 but unfortunately I didn't write it down and can be wrong - definitely not the current one. I don't know the os version. I will add you to Google internal thread so that you can ask the involved support engineer more detailed questions.

yujuhong · 2016-07-19T17:22:28Z

@mwielgus if it's GKE, the node should be running on the containerVM image with docker 1.9. Let's continue the discussion in the internal thread and see if we can uncover more information. Thanks!

timstclair · 2016-07-19T20:29:23Z

Digging into this more, if any of the subsystems in the libcontainer cgroup manager fail to collect stats, then stats collection is aborted for the container. Without looking at the logs I can't say which system is failing, but my guess is that something is causing collection to fail on that container. Previously cAdvisor was not robust to these types of failures, but the latest version included in 1.3 will log the failure and continue on in a best-effort fashion.

I'm not sure what the remaining action items are here, other than try to track down logs (tracked on internal thread). Can we close this issue?

matchstick · 2016-07-19T20:32:40Z

@yujuhong @mwielgus This does not feel like a P0 to me, can we move it to P2 if we decide to not close it?

mwielgus · 2016-07-19T20:57:45Z

For me a customer calling GKE support because an important K8S feature (HPA) is not working at all due to this bug is an indication of a P0 issue.

The main problem is not that some stack trace is written to some log but that no metrics are available for a running pod causing various hard to debug disruptions across the system (HPA, dashboards, metrics storage, upcoming kubectl top, any usage base scheduing we may have one day). I'm against calling it P2 because we most likely won't ever come back to it. Not to mention the GKE customer satisfaction.

cc: @piosz @jszczepkowski @fgrzadkowski

matchstick · 2016-07-19T22:14:51Z

@piosz @jszczepkowski @fgrzadkowski @yujuhong @mwielgus @timstclair

@mwielgus and I communicated offline.

The larger issue of what the customer needs is the P0 problem that has to be addressed. It is not captured well in this issue.

The more narrow focus in this issue of the fact the metrics is not being collected is addressed by Tim's comments.

We are not going to drop the customer issue and are opening another issue to ensure the customer is satisfied. I am not sure this issue is the best place to resolve it. As we learn more we will update this issue. For now we are going to make it a P2 unless it turns out that the customer info causes us to raise the priority.

fejta-bot · 2017-12-16T14:19:36Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

mwielgus added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/kubelet area/kubelet-api sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 15, 2016

mwielgus assigned timstclair Jul 15, 2016

matchstick added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jul 19, 2016

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2017

yujuhong closed this as completed Dec 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal error in kubelet summary API #29007

Internal error in kubelet summary API #29007

mwielgus commented Jul 15, 2016

mwielgus commented Jul 19, 2016

mwielgus commented Jul 19, 2016

matchstick commented Jul 19, 2016

yujuhong commented Jul 19, 2016

timstclair commented Jul 19, 2016

mwielgus commented Jul 19, 2016

yujuhong commented Jul 19, 2016

timstclair commented Jul 19, 2016

matchstick commented Jul 19, 2016

mwielgus commented Jul 19, 2016 •

edited

Loading

matchstick commented Jul 19, 2016

fejta-bot commented Dec 16, 2017

Internal error in kubelet summary API #29007

Internal error in kubelet summary API #29007

Comments

mwielgus commented Jul 15, 2016

mwielgus commented Jul 19, 2016

mwielgus commented Jul 19, 2016

matchstick commented Jul 19, 2016

yujuhong commented Jul 19, 2016

timstclair commented Jul 19, 2016

mwielgus commented Jul 19, 2016

yujuhong commented Jul 19, 2016

timstclair commented Jul 19, 2016

matchstick commented Jul 19, 2016

mwielgus commented Jul 19, 2016 • edited Loading

matchstick commented Jul 19, 2016

fejta-bot commented Dec 16, 2017

mwielgus commented Jul 19, 2016 •

edited

Loading