-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal error in kubelet summary API #29007
Comments
This is P0. Any updates? |
@mwielgus, how often does this happen and what's your cluster setup (e.g., docker version, os distro, cloud provider, etc)? |
We've seen this before, and Kubernetes 1.3 includes a partial fix: #25933 I'm not sure if we ever uncovered the underlying issue, but I'm digging into it now. |
This issue was discovered on a client GKE cluster during a hangout debugging session. We asked for kubelet.log but AFAIK it has not been provided yet. We also asked them about the docker version and it was relatively old - something along 1.9 but unfortunately I didn't write it down and can be wrong - definitely not the current one. I don't know the os version. I will add you to Google internal thread so that you can ask the involved support engineer more detailed questions. |
@mwielgus if it's GKE, the node should be running on the containerVM image with docker 1.9. Let's continue the discussion in the internal thread and see if we can uncover more information. Thanks! |
Digging into this more, if any of the subsystems in the libcontainer cgroup manager fail to collect stats, then stats collection is aborted for the container. Without looking at the logs I can't say which system is failing, but my guess is that something is causing collection to fail on that container. Previously cAdvisor was not robust to these types of failures, but the latest version included in 1.3 will log the failure and continue on in a best-effort fashion. I'm not sure what the remaining action items are here, other than try to track down logs (tracked on internal thread). Can we close this issue? |
For me a customer calling GKE support because an important K8S feature (HPA) is not working at all due to this bug is an indication of a P0 issue. The main problem is not that some stack trace is written to some log but that no metrics are available for a running pod causing various hard to debug disruptions across the system (HPA, dashboards, metrics storage, upcoming kubectl top, any usage base scheduing we may have one day). I'm against calling it P2 because we most likely won't ever come back to it. Not to mention the GKE customer satisfaction. |
@piosz @jszczepkowski @fgrzadkowski @yujuhong @mwielgus @timstclair @mwielgus and I communicated offline. The larger issue of what the customer needs is the P0 problem that has to be addressed. It is not captured well in this issue. The more narrow focus in this issue of the fact the metrics is not being collected is addressed by Tim's comments. We are not going to drop the customer issue and are opening another issue to ensure the customer is satisfied. I am not sure this issue is the best place to resolve it. As we learn more we will update this issue. For now we are going to make it a P2 unless it turns out that the customer info causes us to raise the priority. |
Issues go stale after 30d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Happens consistently on one of Kubernetes 1.2.4 nodes.
The text was updated successfully, but these errors were encountered: