After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats #48483

unixwitch · 2017-07-05T02:18:11Z

Is this a BUG REPORT or FEATURE REQUEST?: Bug report.

/kind bug

What happened:

I upgraded a cluster from 1.6.6 to 1.7.0. Kubelet no longer reports cAdvisor metrics such as container_cpu_usage_seconds_total on its metrics endpoint (https://node:10250/metrics/). Kubelet's own metrics are still there. cAdvisor itself (http://node:4194/) does show container metrics.

What you expected to happen:

Nothing in the release notes suggests this interface has changed, so I expected the metrics would still be there.

How to reproduce it (as minimally and precisely as possible):

I don't know, but I can reproduce it reliably on this cluster; rebooting or reinstalling nodes doesn't make a difference.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0+coreos.0", GitCommit:"8c1bf133b4129042ef8f7d1ffac1be14ee83ed10", GitTreeState:"clean", BuildDate:"2017-06-30T17:46:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration**: GCE
OS (e.g. from /etc/os-release): CoreOS 1409.5.0
Kernel (e.g. uname -a): Linux staging-worker-710d.c.torchkube.internal 4.11.6-coreos-r1 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Thu Jun 22 22:04:38 UTC 2017 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux
Install tools: Custom scripts.
Others:

The text was updated successfully, but these errors were encountered:

k8s-github-robot · 2017-07-05T02:18:27Z

@unixwitch There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
e.g., @kubernetes/sig-api-machinery-* for API Machinery
(2) specifying the label manually: /sig <label>
e.g., /sig scalability for sig/scalability

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

unixwitch · 2017-07-05T02:22:32Z

@kubernetes/sig-node-misc

k8s-ci-robot · 2017-07-05T02:22:40Z

@unixwitch: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-misc.

In response to this:

@kubernetes/sig-node-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dixudx · 2017-07-05T05:29:52Z

@unixwitch seems to be related to cadvisor. See whether PR #48485 could fix this.

unixwitch · 2017-07-05T10:52:03Z

Using latest release-1.7 plus 71160031 doesn't seem to make a difference. It logs this at startup now:

Jul 05 10:42:45 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: I0705 10:42:45.483241   21596 cadvisor_linux.go:124] starting cadvisor manager ...
Jul 05 10:42:46 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: I0705 10:42:46.218169   21596 cadvisor_linux.go:124] starting cadvisor manager ...

But the metrics are still missing:

# curl -isSk --cert /var/lib/prometheus/k8s/torchbox-staging-crt.pem --key /var/lib/prometheus/k8s/torchbox-staging-key.pem https://172.31.208.9:10250/metrics | grep container_cpu
#

unixwitch · 2017-07-05T11:00:17Z

I'm not sure if this is related, but Kubelet is also logging this every 10 seconds:

Jul 05 10:53:11 staging-worker-710d.c.torchkube.internal kubelet-test[21596]: W0705 10:53:11.192776   21596 helpers.go:771] eviction manager: no observation found for eviction signal allocatableNodeFs.available

unixwitch · 2017-07-05T11:28:27Z

This looks the same as #47744, but the fix for that was merged before 1.7.0 release, so I'm not sure why it's still broken.

@Random-Liu

FarhadF · 2017-07-05T16:26:48Z

I have the same issue on newly installed cluster.

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Missing all container_* metrics in http://10.100.1.3:10254/metrics | grep container_*
I can see other metrics without grep!

Container metrics are available on curl http://localhost:10255/stats/summary

{
  "node": {
   "nodeName": "k2",
   "systemContainers": [
    {
     "name": "kubelet",
     "startTime": "2017-07-05T16:13:55Z",
     "cpu": {
      "time": "2017-07-05T16:19:30Z",
      "usageNanoCores": 29075162,
      "usageCoreNanoSeconds": 12165039327
     },
     "memory": {
      "time": "2017-07-05T16:19:30Z",
      "usageBytes": 37052416,
      "workingSetBytes": 36323328,
      "rssBytes": 34512896,
      "pageFaults": 123283,
      "majorPageFaults": 10
     },
     "userDefinedMetrics": null
    },
    {
     "name": "runtime",
     "startTime": "2017-07-03T09:30:05Z",
     "cpu": {
      "time": "2017-07-05T16:19:37Z",
      "usageNanoCores": 5825907,
      "usageCoreNanoSeconds": 1184794434270
     },
     "memory": {
      "time": "2017-07-05T16:19:37Z",
      "usageBytes": 646012928,
      "workingSetBytes": 235999232,
      "rssBytes": 60485632,
      "pageFaults": 617224,
      "majorPageFaults": 325
     },
     "userDefinedMetrics": null
    }
   ],
   "startTime": "2017-07-03T09:30:05Z",
   "cpu": {
    "time": "2017-07-05T16:19:37Z",
    "usageNanoCores": 98265931,
    "usageCoreNanoSeconds": 9257390739986
   },
   "memory": {
    "time": "2017-07-05T16:19:37Z",
    "availableBytes": 1477287936,
    "usageBytes": 1241866240,
    "workingSetBytes": 624513024,
    "rssBytes": 647168,
    "pageFaults": 41456,
    "majorPageFaults": 95
   },
   "fs": {
    "time": "2017-07-05T16:19:37Z",
    "availableBytes": 3012079616,
    "capacityBytes": 6166740992,
    "usedBytes": 2821214208,
    "inodesFree": 320027,
    "inodes": 387072,
    "inodesUsed": 67045
   },
   "runtime": {
    "imageFs": {
     "time": "2017-07-05T16:19:37Z",
     "availableBytes": 3012079616,
     "capacityBytes": 6166740992,
     "usedBytes": 801880663,
     "inodesFree": 320027,
     "inodes": 387072,
     "inodesUsed": 67045
    }
   }
  },
  "pods": [
   {
    "podRef": {
     "name": "kubernetes-dashboard-103235509-q4m9d",
     "namespace": "kube-system",
     "uid": "267b4bf8-5fe6-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T12:01:30Z",
    "containers": [
     {
      "name": "kubernetes-dashboard",
      "startTime": "2017-07-03T12:01:31Z",
      "cpu": {
       "time": "2017-07-05T16:19:35Z",
       "usageNanoCores": 1180606,
       "usageCoreNanoSeconds": 61823932041
      },
      "memory": {
       "time": "2017-07-05T16:19:35Z",
       "usageBytes": 23384064,
       "workingSetBytes": 23384064,
       "rssBytes": 22962176,
       "pageFaults": 10201,
       "majorPageFaults": 35
      },
      "rootfs": {
       "time": "2017-07-05T16:19:35Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 135471104,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 12
      },
      "logs": {
       "time": "2017-07-05T16:19:35Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 36864,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:43Z",
     "rxBytes": 9887383,
     "rxErrors": 0,
     "txBytes": 23368295,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-60j8w"
     }
    ]
   },
   {
    "podRef": {
     "name": "node-exporter-60p8r",
     "namespace": "monitoring",
     "uid": "2e6af934-6005-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T15:35:02Z",
    "containers": [
     {
      "name": "node-exporter",
      "startTime": "2017-07-03T15:35:02Z",
      "cpu": {
       "time": "2017-07-05T16:19:30Z",
       "usageNanoCores": 1185574,
       "usageCoreNanoSeconds": 144826707561
      },
      "memory": {
       "time": "2017-07-05T16:19:30Z",
       "usageBytes": 8609792,
       "workingSetBytes": 8609792,
       "rssBytes": 8179712,
       "pageFaults": 4938,
       "majorPageFaults": 9
      },
      "rootfs": {
       "time": "2017-07-05T16:19:30Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 21422080,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 12
      },
      "logs": {
       "time": "2017-07-05T16:19:30Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 28672,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   },
   {
    "podRef": {
     "name": "nginx-ingress-controller-d6h56",
     "namespace": "kube-system",
     "uid": "ce0ecea5-5ff6-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T13:52:08Z",
    "containers": [
     {
      "name": "nginx-ingress-controller",
      "startTime": "2017-07-03T13:52:08Z",
      "cpu": {
       "time": "2017-07-05T16:19:41Z",
       "usageNanoCores": 3253897,
       "usageCoreNanoSeconds": 423721194278
      },
      "memory": {
       "time": "2017-07-05T16:19:41Z",
       "usageBytes": 79507456,
       "workingSetBytes": 79491072,
       "rssBytes": 75460608,
       "pageFaults": 616490,
       "majorPageFaults": 33
      },
      "rootfs": {
       "time": "2017-07-05T16:19:41Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 130162688,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 29
      },
      "logs": {
       "time": "2017-07-05T16:19:41Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 49152,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-60j8w"
     }
    ]
   },
   {
    "podRef": {
     "name": "kube-state-metrics-deployment-1863931462-7ckb2",
     "namespace": "monitoring",
     "uid": "169b8f97-6185-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-05T13:23:09Z",
    "containers": [
     {
      "name": "kube-state-metrics",
      "startTime": "2017-07-05T13:23:09Z",
      "cpu": {
       "time": "2017-07-05T16:19:29Z",
       "usageNanoCores": 593473,
       "usageCoreNanoSeconds": 7616961025
      },
      "memory": {
       "time": "2017-07-05T16:19:29Z",
       "usageBytes": 11620352,
       "workingSetBytes": 11620352,
       "rssBytes": 11276288,
       "pageFaults": 5246,
       "majorPageFaults": 0
      },
      "rootfs": {
       "time": "2017-07-05T16:19:29Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 45719552,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 13
      },
      "logs": {
       "time": "2017-07-05T16:19:29Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 24576,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:38Z",
     "rxBytes": 7551698,
     "rxErrors": 0,
     "txBytes": 3007631,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   },
   {
    "podRef": {
     "name": "grafana-3205277920-3rv9g",
     "namespace": "monitoring",
     "uid": "5ed85a62-6009-11e7-a494-0050568a7e6c"
    },
    "startTime": "2017-07-03T16:05:01Z",
    "containers": [
     {
      "name": "grafana",
      "startTime": "2017-07-03T16:05:02Z",
      "cpu": {
       "time": "2017-07-05T16:19:32Z",
       "usageNanoCores": 1523897,
       "usageCoreNanoSeconds": 302809923832
      },
      "memory": {
       "time": "2017-07-05T16:19:32Z",
       "usageBytes": 71905280,
       "workingSetBytes": 35860480,
       "rssBytes": 12009472,
       "pageFaults": 3290696,
       "majorPageFaults": 14
      },
      "rootfs": {
       "time": "2017-07-05T16:19:32Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 316682240,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 13
      },
      "logs": {
       "time": "2017-07-05T16:19:32Z",
       "availableBytes": 3012079616,
       "capacityBytes": 6166740992,
       "usedBytes": 196608,
       "inodesFree": 320027,
       "inodes": 387072,
       "inodesUsed": 67045
      },
      "userDefinedMetrics": null
     }
    ],
    "network": {
     "time": "2017-07-05T16:19:36Z",
     "rxBytes": 14553836,
     "rxErrors": 0,
     "txBytes": 174770316,
     "txErrors": 0
    },
    "volume": [
     {
      "time": "2017-07-05T16:14:55Z",
      "availableBytes": 1050886144,
      "capacityBytes": 1050898432,
      "usedBytes": 12288,
      "inodesFree": 256558,
      "inodes": 256567,
      "inodesUsed": 9,
      "name": "default-token-f74v5"
     }
    ]
   }
  ]
 }

dixudx · 2017-07-06T02:50:29Z

@FarhadF But it works well on my new created cluster.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

$ curl http://localhost:4194/metrics | grep container_*
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP container_cpu_system_seconds_total Cumulative system cpu time consumed in seconds.
# TYPE container_cpu_system_seconds_total counter
container_cpu_system_seconds_total{id="/"} 302.97
container_cpu_system_seconds_total{id="/docker"} 22.5
container_cpu_system_seconds_total{id="/init.scope"} 0.72
container_cpu_system_seconds_total{id="/kubepods"} 37.44
container_cpu_system_seconds_total{id="/kubepods/besteffort"} 37.47
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod541daf716354cf26f8397227012897da"} 13.89
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod82b0a0bc89364213d292b9240a42d1ab"} 2.46
container_cpu_system_seconds_total{id="/kubepods/besteffort/pod82b0a0bc89364213d292b9240a42d1ab/41ecc652971c6f77055b843a22f8eb09d93a354745e1e175e1b1e7d0f823c152/kube-proxy"} 2.4
container_cpu_system_seconds_total{id="/kubepods/besteffort/podcc6968656fd8366efd6c451ff7e122f4"} 14.61
container_cpu_system_seconds_total{id="/kubepods/besteffort/podf70b33a895a6f7d2a84d34fc5af97783"} 6.11
container_cpu_system_seconds_total{id="/kubepods/burstable"} 0
container_cpu_system_seconds_total{id="/system.slice"} 42.52
container_cpu_system_seconds_total{id="/system.slice/audit-rules.service"} 0
container_cpu_system_seconds_total{id="/system.slice/containerd.service"} 1.1
container_cpu_system_seconds_total{id="/system.slice/coreos-setup-environment.service"} 0
....
....

dchen1107 · 2017-07-06T18:04:03Z

This looks like a dup of #47744. @dashpole can you please verify this? Thanks!

dashpole · 2017-07-06T19:12:05Z

On a newly created cluster from head, this particular issue appears to be resolved, and is most likely a dup of #47744.

curl localhost:4194/metrics | grep container_cpu_usage_seconds_total % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed per cpu in seconds. TYPE container_cpu_usage_seconds_total counter container_cpu_usage_seconds_total{cpu="cpu00",id="/"} 142.443179425 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods"} 87.230398293 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/besteffort"} 0.141097833 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/besteffort/podb9ffb65e628276cfe6b3ab57640baa55"} 0.141097833 container_cpu_usage_seconds_total{cpu="cpu00",id="/kubepods/burstable"} 86.840747259 ...

unixwitch · 2017-07-06T19:20:10Z

I'm not sure this is #47744 because it was still broken for me with 1.7.1-beta.0.3 Kubelet (with 1.7.0 master). That build does have e90c477 in it, which I thought was the fix for #47744.

I can bring up a test cluster to see if this is related to upgrading, but I imagine that's unlikely. Maybe it's affected by commandline options or system configuration? (Running in rkt vs on the host made no difference for me.)

unixwitch · 2017-07-06T19:54:45Z

New cluster with 1.7.1-beta.0.3 Kubelet:

test48483-master-mgtc ~ # kubectl get --all-namespaces pod -owide | grep test48483-worker-ng6f
2017-07-06 19:52:59.358716 I | proto: duplicate proto type registered: google.protobuf.Any
2017-07-06 19:52:59.358891 I | proto: duplicate proto type registered: google.protobuf.Duration
2017-07-06 19:52:59.358960 I | proto: duplicate proto type registered: google.protobuf.Timestamp
kube-lego       kube-lego-4240885720-wqfv7                      1/1       Running   0          1m        172.29.1.3      test48483-worker-ng6f
kube-system     calico-node-vb5l6                               1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
kube-system     kube-proxy-test48483-worker-ng6f                1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
kube-system     kube-state-metrics-1811189913-0fvmh             1/1       Running   0          1m        172.29.1.4      test48483-worker-ng6f
kube-system     kube-state-metrics-1811189913-5fm84             1/1       Running   0          1m        172.29.1.2      test48483-worker-ng6f
kube-system     node-exporter-jj4jd                             1/1       Running   0          1m        172.31.208.25   test48483-worker-ng6f
test48483-master-mgtc ~ # curl -sS http://test48483-worker-ng6f:10255/metrics|grep container_
# HELP kubelet_running_container_count Number of containers currently running
# TYPE kubelet_running_container_count gauge
kubelet_running_container_count 6
kubelet_runtime_operations{operation_type="container_status"} 6
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.5"} 2337
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.9"} 3912
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.99"} 3912
kubelet_runtime_operations_latency_microseconds_sum{operation_type="container_status"} 19886
kubelet_runtime_operations_latency_microseconds_count{operation_type="container_status"} 6
test48483-master-mgtc ~ #

dashpole · 2017-07-06T20:40:32Z

@unixwitch I finally realized you are using the wrong port. 10255 is the kubelet's port for prometheus metrics. As you can see, it gives a metric for runtime operation latency. Port 4194 is the cadvisor port, which has container metrics. See if that works.

unixwitch · 2017-07-07T08:36:58Z

@dashpole The problem is that in 1.6 and earlier, port 10255 returned cAdvisor container metrics. The fact it no longer does is an incompatible change which has broken Prometheus, which uses this port to scrape from: https://github.com/prometheus/prometheus/blob/release-1.7/discovery/kubernetes/node.go#L156

If this was intentionally changed, shouldn't there have been an entry in the release notes?

Does this also mean it's now impossible to scrape container metrics over TLS (which worked before using port 10250)? That seems like a significant regression in functionality.

smarterclayton · 2017-07-07T13:02:25Z

This does seem like a regression in behavior.

dchen1107 · 2017-07-07T18:40:25Z

@luxas is this caused by your change on cAdvisor availability: kubernetes/release#356?

luxas · 2017-07-07T19:08:30Z

@dchen1107 No, definitely not. That was disabling the public cAdvisor port for kubeadm setups only.

It's reported that custom scripts were used and this happened even though cAdvisor was accessible publicly.

luxas · 2017-07-07T19:09:14Z

This seems very kubelet-internal. Also notice the error log message attached above

unixwitch · 2017-07-07T19:22:45Z

I wasn't aware of kubernetes/release#356, but if I understand it right, this means a cluster installed by kubeadm has no way to access cAdvisor metrics from Prometheus at all (without manual configuration by the administrator): they are no longer exposed by Kubelet, and they can't be retrieved from cAdvisor directly because its HTTP server is disabled.

It seems to me that disabling cAdvisor by default is a good idea (metrics should not be exposed to the world without authentication) and the new behaviour in Kubelet should be reverted so that metrics are once again available behind authentication. Although it's still not clear to me if the Kubelet change was intentional or not, and if so, what the rationale was for it.

unixwitch · 2017-07-07T19:31:43Z

(As an aside, I was planning to disable cAdvisor with --cadvisor-port=0 in our clusters to avoid exposing unauthenticated metrics, but I had to revert that for 1.7.0 because of this change; so even though we don't use kubeadm, this is still a functionality regression for us, even if we can work around it.)

luxas · 2017-07-07T20:27:13Z

I'm still pretty sure cAdvisor is running just fine and pretty much everything still works although you disable the cAdvisor public port. cAdvisor is run inside of the kubelet and still accessible at <node-ip>:10250/stats/ IIRC. That endpoint shows you everything cAdvisor would have show in an unauthenticated manner.

However, in order to be focused, I think that that is unrelated to the issue being present here. Even though cAdvisor is externally accessible kubelet won't show these container metrics in its API, right?

Which is indeed a regression from v1.6

unixwitch · 2017-07-07T20:33:26Z

cAdvisor is run inside of the kubelet and still accessible at :10250/stats/

But this outputs JSON, which Prometheus doesn't understand. There is no way to collect the metrics in Prometheus format any more, at least in kubeadm's default configuration. (Edit: unless there's a way to make /stats/ output the metrics in Prometheus format. But I couldn't find any documentation suggesting that is the case.)

I think that that is unrelated to the issue being present here

Well, the two changes are unrelated, yes. But the combination of both together is quite unfortunate for Prometheus users as both existing sources of Prometheus-format cAdvisor metrics have been disabled at the same time.

Even though cAdvisor is externally accessible kubelet won't show these container metrics in its API, right?

Right. The only way to collect the metrics in Prometheus format is via the cAdvisor HTTP server.

luxas · 2017-07-08T05:18:38Z

So the right thing to do here now is to investigate what made kubelet stop reporting cAdvisor container metrics in its own /metrics endpoint in all cases.

Hopefully we can patch this and restore the v1.6 behavior.

dashpole · 2017-07-10T20:38:15Z

cc @grobie
Ok, so I have tracked the issue down to google/cadvisor#1460.
Specifically, changing prometheus.MustRegister( to r := prometheus.NewRegistry; r.MustRegister( caused the metrics to no longer be displayed on the kubelet's port 10250/metrics, and only on port 4194/metrics.
Based on the original issue, I don't think this behavior was intended, although I could be wrong.

luxas · 2017-07-11T07:57:39Z

If the consensus is the kubelet should always unconditionally export cAdvisor metrics on the kubelet port, it'd be necessary to register the cAdvisor collector on that metrics handler as well.

That has been the case earlier, and is a behavior we must/should continue to have.

I wonder though how someone would be able to disable cAdvisor metrics in kubelets, or is that not desired?

We haven't had a flag so far, so having that reporting always on for now is fine. We might be able to add a flag, but no one has asked for it AFAIK, so for now it makes sense to always enable.

luxas · 2017-07-11T07:58:28Z

My understanding from the comments so far is that it is desirable to be able to disable unauthenticated metrics ports but the normal, authenticated kubelet port should unconditionally include these metrics?

Correct

That may change in a future release but it was the case so far and the 1.6->1.7 change was unintentional and unannounced.

Correct, and I think we're planning to fix it in a v1.7 patch release

brian-brazil · 2017-07-11T07:59:24Z

It my opinion that cadvisor stats don't belong mixed in with the stat of the kubelet itself. These stats have different audiences (one is the cluster admin, the other is roughly cluster users), and putting them out through the same endpoint means that if e.g. you're a cluster admin you have to filter out all these uninteresting (and expensive) metrics just to see kubelet health.

luxas · 2017-07-11T08:05:11Z

@brian-brazil Happy to have that discussion in sig-instrumentation, but IMO, it's more important to fix this issue, get things back to normal, and then plan for a possible deprecation and removal (after ~6 months) of the feature when we have a viable alternative.

grobie · 2017-07-11T10:06:49Z

I will be working on a fix, will send a PR tomorrow hopefully.

alindeman · 2017-07-12T16:45:25Z

@grobie Do you expect to change it back so that :10255/metrics includes cAdvisor metrics? Or will the fix be something different? I ask because this broke prometheus-operator's ability to scrape cAdvisor metrics, and I'm wondering if I should propose a change to prometheus-operator to look for metrics on the cAdvisor port, or just hold out for cAdvisor metrics to come back on port 10255.

grobie · 2017-07-12T17:07:53Z

@alindeman I understood the request to bring back cAdvisor metrics on :10255/metrics for now to restore the 1.6 behavior.

I'm still trying to find the best way to restore the old behavior and test the fix, and given the recent events at SoundCloud I'm also quite busy at the moment, but should have a PR ready by tomorrow.

alindeman · 2017-07-12T17:08:44Z

@grobie Thanks for working on it ❤️

smarterclayton · 2017-07-18T04:58:02Z

We could potentially reintroduce this at a new, cadvisor specific host endpoint such as :10250/metrics/cadvisor and also correct some of the issues related to consistency mentioned in #45053. Agree with the cost profile of the metrics - it's likely you'd want to scrape kubelet and this endpoint at different endpoints.

I have a quick patch that mostly cleanly puts cadvisor registration at the new path. While keeping exact compatibility is desirable, I don't think moving scrapes to a new path violates the looser API guarantees on the metrics endpoints if we can improve the scalability of the collectors at the same time. Unsecured metrics are a bigger problem, especially where we are regressing from securing them with the kubelet security profile to a lower (even if local) level.

smarterclayton · 2017-07-18T04:58:24Z

@DirectXMan12 i'm inclined to do the separation but on the main port - opinions?

fgrzadkowski · 2017-07-18T16:36:06Z

@kubernetes/sig-instrumentation-bugs @piosz

Automatic merge from submit-queue Restore cAdvisor prometheus metrics to the main port But under a new path - `/metrics/cadvisor`. This ensures a secure port still exists for metrics while getting the benefit of separating out container metrics from the kubelet's metrics as recommended in the linked issue. Fixes #48483 ```release-note-action-required Restored cAdvisor prometheus metrics to the main port -- a regression that existed in v1.7.0-v1.7.2 cAdvisor metrics can now be scraped from `/metrics/cadvisor` on the kubelet ports. Note that you have to update your scraping jobs to get kubelet-only metrics from `/metrics` and `container_*` metrics from `/metrics/cadvisor` ```

grobie · 2017-07-19T13:26:55Z

Thanks a lot for picking this up @smarterclayton. I got a bit stuck writing an acceptance test for the expected metrics under /metrics. While it's breaking compatibility with 1.6, I think splitting metrics in general makes sense.

luxas · 2017-07-19T14:00:05Z

We should definitely have a conformance test for this now -- feel free to write one @grobie :)

Kubernetes 1.7+ no longer exposes cAdvisor metrics on the Kubelet metrics endpoint. Update the example configuration to scrape cAdvisor in addition to Kubelet. The provided configuration works for 1.7.3+ and commented notes are given for 1.7.2 and earlier versions. Also remove the comment about node (Kubelet) CA not matching the master CA. Since the example no longer connects directly to the nodes, it doesn't matter what CA they're using. References: - kubernetes/kubernetes#48483 - kubernetes/kubernetes#49079

hanikesn · 2017-07-24T11:32:43Z

Sorry to hijack this issue. But there's clearly a problem with the cadvisor endpoint in 1.7.1. It randomly reports either systemd cgroups or docker containers e.g. for container_memory_usage_bytes.

matthiasr · 2017-07-24T12:02:59Z

Please don't hijack issues, it just creates confusion. Once this change is released (presumably with 1.7.3) or building from the release branch before that, please confirm whether your issue persists. If it does, it's a new issue, please file it separately. If it doesn't, it was probably related, but is already dealt with.

zz · 2017-12-08T05:15:46Z

add kubernetes-cadvisors job in prometheus config to fix prometheus miss container_* metrics, if you install prometheus with helm.

      - job_name: 'kubernetes-cadvisors'

        # Default to scraping over https. If required, just disable this or change to
        # `http`.
        scheme: https

        # This TLS & bearer token file config is used to connect to the actual scrape
        # endpoints for cluster components. This is separate to discovery auth
        # configuration because discovery & scraping are two separate concerns in
        # Prometheus. The discovery auth config is automatic if Prometheus runs inside
        # the cluster. Otherwise, more config options have to be provided within the
        # <kubernetes_sd_config>.
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          # If your node certificates are self-signed or use a different CA to the
          # master CA, then disable certificate verification below. Note that
          # certificate verification is an integral part of a secure infrastructure
          # so this should only be disabled in a controlled environment. You can
          # disable certificate verification by uncommenting the line below.
          #
          insecure_skip_verify: true
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
          - role: node

        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}:4194/proxy/metrics

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 5, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 5, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 5, 2017

dchen1107 assigned dashpole Jul 6, 2017

unixwitch mentioned this issue Jul 7, 2017

Kubernetes 1.7.0 requires cAdvisor changes prometheus/prometheus#2916

Closed

alindeman mentioned this issue Jul 12, 2017

Scrapes cAdvisor port for metrics in Kubernetes 1.7 prometheus-operator/prometheus-operator#477

Merged

smarterclayton added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Jul 18, 2017

This was referenced Jul 18, 2017

Restore cAdvisor prometheus metrics to the main port #49079

Merged

post-rebase TODOs openshift/origin#14647

Closed

k8s-github-robot closed this as completed in #49079 Jul 19, 2017

squat mentioned this issue Jul 20, 2017

V1.7.1 patchset coreos/kubernetes#152

Closed

pizzarabe mentioned this issue Jul 20, 2017

kubelet keeps attempting to reclaim nodefs after upgrade to 1.7.0 #48703

Closed

hanikesn mentioned this issue Jul 24, 2017

Prometheus 2.0.0-beta.0 gaps in metrics prometheus/prometheus#2982

Closed

zatricky mentioned this issue Aug 31, 2017

Updates for 1.7 cadvisor changes and rbac requirements giantswarm/prometheus#73

Merged

nrmitchi mentioned this issue Nov 6, 2017

[prometheus] cAdvisor Stats not included by default helm/charts#2672

Closed

discordianfish mentioned this issue Nov 9, 2017

Improve cadvisor (integration) tests #55398

Closed

After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats #48483

After upgrading to 1.7.0, Kubelet no longer reports cAdvisor stats #48483

Comments

unixwitch commented Jul 5, 2017

k8s-github-robot commented Jul 5, 2017

unixwitch commented Jul 5, 2017

k8s-ci-robot commented Jul 5, 2017

dixudx commented Jul 5, 2017

unixwitch commented Jul 5, 2017

unixwitch commented Jul 5, 2017

unixwitch commented Jul 5, 2017

FarhadF commented Jul 5, 2017

dixudx commented Jul 6, 2017

dchen1107 commented Jul 6, 2017

dashpole commented Jul 6, 2017

unixwitch commented Jul 6, 2017

unixwitch commented Jul 6, 2017

dashpole commented Jul 6, 2017

unixwitch commented Jul 7, 2017

smarterclayton commented Jul 7, 2017

dchen1107 commented Jul 7, 2017 • edited Loading

luxas commented Jul 7, 2017

luxas commented Jul 7, 2017

unixwitch commented Jul 7, 2017

unixwitch commented Jul 7, 2017

luxas commented Jul 7, 2017

unixwitch commented Jul 7, 2017 • edited Loading

luxas commented Jul 8, 2017

dashpole commented Jul 10, 2017

luxas commented Jul 11, 2017

luxas commented Jul 11, 2017

brian-brazil commented Jul 11, 2017

luxas commented Jul 11, 2017

grobie commented Jul 11, 2017 • edited Loading

alindeman commented Jul 12, 2017

grobie commented Jul 12, 2017

alindeman commented Jul 12, 2017

smarterclayton commented Jul 18, 2017 • edited Loading

smarterclayton commented Jul 18, 2017

fgrzadkowski commented Jul 18, 2017

grobie commented Jul 19, 2017

luxas commented Jul 19, 2017

hanikesn commented Jul 24, 2017

matthiasr commented Jul 24, 2017

zz commented Dec 8, 2017 • edited Loading

dchen1107 commented Jul 7, 2017 •

edited

Loading

unixwitch commented Jul 7, 2017 •

edited

Loading

grobie commented Jul 11, 2017 •

edited

Loading

smarterclayton commented Jul 18, 2017 •

edited

Loading

zz commented Dec 8, 2017 •

edited

Loading