-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats #48483
Comments
@unixwitch There are no sig labels on this issue. Please add a sig label by: |
@kubernetes/sig-node-misc |
@unixwitch: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@unixwitch seems to be related to cadvisor. See whether PR #48485 could fix this. |
Using latest release-1.7 plus 71160031 doesn't seem to make a difference. It logs this at startup now:
But the metrics are still missing:
|
I'm not sure if this is related, but Kubelet is also logging this every 10 seconds:
|
This looks the same as #47744, but the fix for that was merged before 1.7.0 release, so I'm not sure why it's still broken. |
I have the same issue on newly installed cluster.
Missing all Container metrics are available on
|
@FarhadF But it works well on my new created cluster.
|
On a newly created cluster from head, this particular issue appears to be resolved, and is most likely a dup of #47744.
|
I'm not sure this is #47744 because it was still broken for me with 1.7.1-beta.0.3 Kubelet (with 1.7.0 master). That build does have e90c477 in it, which I thought was the fix for #47744. I can bring up a test cluster to see if this is related to upgrading, but I imagine that's unlikely. Maybe it's affected by commandline options or system configuration? (Running in rkt vs on the host made no difference for me.) |
New cluster with 1.7.1-beta.0.3 Kubelet:
|
@unixwitch I finally realized you are using the wrong port. 10255 is the kubelet's port for prometheus metrics. As you can see, it gives a metric for runtime operation latency. Port 4194 is the cadvisor port, which has container metrics. See if that works. |
@dashpole The problem is that in 1.6 and earlier, port 10255 returned cAdvisor container metrics. The fact it no longer does is an incompatible change which has broken Prometheus, which uses this port to scrape from: https://github.com/prometheus/prometheus/blob/release-1.7/discovery/kubernetes/node.go#L156 If this was intentionally changed, shouldn't there have been an entry in the release notes? Does this also mean it's now impossible to scrape container metrics over TLS (which worked before using port 10250)? That seems like a significant regression in functionality. |
This does seem like a regression in behavior. |
@luxas is this caused by your change on cAdvisor availability: kubernetes/release#356? |
@dchen1107 No, definitely not. That was disabling the public cAdvisor port for kubeadm setups only. It's reported that custom scripts were used and this happened even though cAdvisor was accessible publicly. |
This seems very kubelet-internal. Also notice the error log message attached above |
I wasn't aware of kubernetes/release#356, but if I understand it right, this means a cluster installed by kubeadm has no way to access cAdvisor metrics from Prometheus at all (without manual configuration by the administrator): they are no longer exposed by Kubelet, and they can't be retrieved from cAdvisor directly because its HTTP server is disabled. It seems to me that disabling cAdvisor by default is a good idea (metrics should not be exposed to the world without authentication) and the new behaviour in Kubelet should be reverted so that metrics are once again available behind authentication. Although it's still not clear to me if the Kubelet change was intentional or not, and if so, what the rationale was for it. |
(As an aside, I was planning to disable cAdvisor with |
I'm still pretty sure cAdvisor is running just fine and pretty much everything still works although you disable the cAdvisor public port. cAdvisor is run inside of the kubelet and still accessible at However, in order to be focused, I think that that is unrelated to the issue being present here. Even though cAdvisor is externally accessible kubelet won't show these container metrics in its API, right? Which is indeed a regression from v1.6 |
But this outputs JSON, which Prometheus doesn't understand. There is no way to collect the metrics in Prometheus format any more, at least in kubeadm's default configuration. (Edit: unless there's a way to make /stats/ output the metrics in Prometheus format. But I couldn't find any documentation suggesting that is the case.)
Well, the two changes are unrelated, yes. But the combination of both together is quite unfortunate for Prometheus users as both existing sources of Prometheus-format cAdvisor metrics have been disabled at the same time.
Right. The only way to collect the metrics in Prometheus format is via the cAdvisor HTTP server. |
So the right thing to do here now is to investigate what made kubelet stop reporting cAdvisor container metrics in its own Hopefully we can patch this and restore the v1.6 behavior. |
cc @grobie |
That has been the case earlier, and is a behavior we must/should continue to have.
We haven't had a flag so far, so having that reporting always on for now is fine. We might be able to add a flag, but no one has asked for it AFAIK, so for now it makes sense to always enable. |
Correct
Correct, and I think we're planning to fix it in a v1.7 patch release |
It my opinion that cadvisor stats don't belong mixed in with the stat of the kubelet itself. These stats have different audiences (one is the cluster admin, the other is roughly cluster users), and putting them out through the same endpoint means that if e.g. you're a cluster admin you have to filter out all these uninteresting (and expensive) metrics just to see kubelet health. |
@brian-brazil Happy to have that discussion in sig-instrumentation, but IMO, it's more important to fix this issue, get things back to normal, and then plan for a possible deprecation and removal (after ~6 months) of the feature when we have a viable alternative. |
I will be working on a fix, will send a PR tomorrow hopefully. |
@grobie Do you expect to change it back so that |
@alindeman I understood the request to bring back cAdvisor metrics on I'm still trying to find the best way to restore the old behavior and test the fix, and given the recent events at SoundCloud I'm also quite busy at the moment, but should have a PR ready by tomorrow. |
@grobie Thanks for working on it ❤️ |
We could potentially reintroduce this at a new, cadvisor specific host endpoint such as I have a quick patch that mostly cleanly puts cadvisor registration at the new path. While keeping exact compatibility is desirable, I don't think moving scrapes to a new path violates the looser API guarantees on the metrics endpoints if we can improve the scalability of the collectors at the same time. Unsecured metrics are a bigger problem, especially where we are regressing from securing them with the kubelet security profile to a lower (even if local) level. |
@DirectXMan12 i'm inclined to do the separation but on the main port - opinions? |
@kubernetes/sig-instrumentation-bugs @piosz |
Automatic merge from submit-queue Restore cAdvisor prometheus metrics to the main port But under a new path - `/metrics/cadvisor`. This ensures a secure port still exists for metrics while getting the benefit of separating out container metrics from the kubelet's metrics as recommended in the linked issue. Fixes #48483 ```release-note-action-required Restored cAdvisor prometheus metrics to the main port -- a regression that existed in v1.7.0-v1.7.2 cAdvisor metrics can now be scraped from `/metrics/cadvisor` on the kubelet ports. Note that you have to update your scraping jobs to get kubelet-only metrics from `/metrics` and `container_*` metrics from `/metrics/cadvisor` ```
Thanks a lot for picking this up @smarterclayton. I got a bit stuck writing an acceptance test for the expected metrics under |
We should definitely have a conformance test for this now -- feel free to write one @grobie :) |
Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet metrics endpoint. Update the example configuration to scrape cAdvisor in addition to Kubelet. The provided configuration works for 1.7.3+ and commented notes are given for 1.7.2 and earlier versions. Also remove the comment about node (Kubelet) CA not matching the master CA. Since the example no longer connects directly to the nodes, it doesn't matter what CA they're using. References: - kubernetes/kubernetes#48483 - kubernetes/kubernetes#49079
Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet metrics endpoint. Update the example configuration to scrape cAdvisor in addition to Kubelet. The provided configuration works for 1.7.3+ and commented notes are given for 1.7.2 and earlier versions. Also remove the comment about node (Kubelet) CA not matching the master CA. Since the example no longer connects directly to the nodes, it doesn't matter what CA they're using. References: - kubernetes/kubernetes#48483 - kubernetes/kubernetes#49079
Sorry to hijack this issue. But there's clearly a problem with the cadvisor endpoint in 1.7.1. It randomly reports either systemd cgroups or docker containers e.g. for |
Please don't hijack issues, it just creates confusion. Once this change is released (presumably with 1.7.3) or building from the release branch before that, please confirm whether your issue persists. If it does, it's a new issue, please file it separately. If it doesn't, it was probably related, but is already dealt with. |
add
|
Is this a BUG REPORT or FEATURE REQUEST?: Bug report.
/kind bug
What happened:
I upgraded a cluster from 1.6.6 to 1.7.0. Kubelet no longer reports cAdvisor metrics such as container_cpu_usage_seconds_total on its metrics endpoint (https://node:10250/metrics/). Kubelet's own metrics are still there. cAdvisor itself (http://node:4194/) does show container metrics.
What you expected to happen:
Nothing in the release notes suggests this interface has changed, so I expected the metrics would still be there.
How to reproduce it (as minimally and precisely as possible):
I don't know, but I can reproduce it reliably on this cluster; rebooting or reinstalling nodes doesn't make a difference.
Anything else we need to know?:
Environment:
kubectl version
): Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0+coreos.0", GitCommit:"8c1bf133b4129042ef8f7d1ffac1be14ee83ed10", GitTreeState:"clean", BuildDate:"2017-06-30T17:46:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}uname -a
): Linux staging-worker-710d.c.torchkube.internal 4.11.6-coreos-r1 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Thu Jun 22 22:04:38 UTC 2017 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/LinuxThe text was updated successfully, but these errors were encountered: