Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubelet PVC stale metrics #59170

Merged

Conversation

cofyc
Copy link
Member

@cofyc cofyc commented Feb 1, 2018

What this PR does / why we need it:

Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from internal
stats.

Currently, if a PV (bound to a PVC testpv) is attached and used by node A, then migrated to node B or just deleted from node A later. testpvc metrics will not disappear from kubelet on node A. After a long running time, kubelet process will keep a lot of stale volume metrics in memory.

For these dynamic metrics, it's better to use a collector to collect metrics from a data source (StatsProvider here), like kube-state-metrics scraping metrics from kube-apiserver.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57686

Special notes for your reviewer:

Release note:

Fix kubelet PVC stale metrics

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 1, 2018
@cofyc
Copy link
Member Author

cofyc commented Feb 1, 2018

/cc @gnufied @wongma7

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 1, 2018
@cofyc cofyc changed the title Fix kubelet PVC metrics using a volume stats collector. Fix kubelet PVC stale metrics Feb 1, 2018
@zhangxiaoyu-zidif
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 1, 2018
@cofyc
Copy link
Member Author

cofyc commented Feb 1, 2018

@zhangxiaoyu-zidif Thanks!

@cofyc cofyc force-pushed the fix_kubelet_volume_metrics branch from 05aab55 to ff8f139 Compare February 1, 2018 07:44

return fmt.Errorf(`metric output does not match expectation; want:
%s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run 'gofmt' =)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cofyc
Copy link
Member Author

cofyc commented Feb 1, 2018

/sig storage
/assign @gnufied @jingxu97

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Feb 1, 2018
@jingxu97
Copy link
Contributor

jingxu97 commented Feb 1, 2018

@cofyc Thanks for your PR. I am ok with this PR, but will wait for @gnufied to review in more detail since I am very familiar with area.
One question is that besides PVC metrics, do you notice any other metrics have the same problem?

@cofyc
Copy link
Member Author

cofyc commented Feb 2, 2018

@jingxu97 Just checked other metrics, they don't have the same problem.

cofyc added a commit to cofyc/kubernetes that referenced this pull request Feb 2, 2018
Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from
stats.

Sync from kubernetes#59170 with some
modifications;

- Use SummaryProvider instead of StatsProvider
- Remove test files.
- Generate build files.
@jingxu97
Copy link
Contributor

jingxu97 commented Feb 2, 2018

@cofyc thank you for checking. What metrics did you check and why pvc metric has the problem and others don't? Thanks!

@cofyc
Copy link
Member Author

cofyc commented Feb 2, 2018

hi, @jingxu97

I've checked all other metrics:

PVC metrics have problem, because they use namespace and pvc names as label values. WithLabelValues(labelValues) will create new metric for each new label values and PVC may be migrated to another node or deleted from cluster, but it's metrics will not to be cleared. So, after a long running time, kubelet will keep a lot of stale volume metrics in memory.

Theoretically there are two ways to fix:

  • Clear its metrics when PVC is removed.
  • Use a collector to generate all PVC metrics from stats when collecting.

The second is good and easier way to go for PVC metrics. It also decouples statistics and metrics generating.

}
addGauge := func(desc *prometheus.Desc, pvcRef *stats.PVCReference, v float64, lv ...string) {
lv = append([]string{pvcRef.Namespace, pvcRef.Name}, lv...)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, v, lv...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that we did not use version that causes panics.

Copy link
Member Author

@cofyc cofyc Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Log a warning message if NewConstMetric emits an error.

podStats = []statsapi.PodStats{
{
PodRef: statsapi.PodReference{Name: "test-pod", Namespace: "test-namespace", UID: "UID_test-pod"},
StartTime: metav1.NewTime(time.Now()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a metav1.Now()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@cofyc
Copy link
Member Author

cofyc commented Feb 9, 2018

@jsafrane

That's weird.

curl -s http://localhost:10255/metrics | grep '^kubelet_volume'

Could you check this output on each node instead of prometheus?

If all PVC volumes on node were correctly unmounted and detached, kubelet /metrics endpoint should have no kubelet_volume_xxx metrics.

@jsafrane
Copy link
Member

jsafrane commented Feb 9, 2018

@cofyc, weird, I can't reproduce the issue today. I recompiled and reinstalled everything and it's gone. Sorry for the noise. Tested and LGTM from my side :-)

@gnufied
Copy link
Member

gnufied commented Feb 9, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2018
@rootfs
Copy link
Contributor

rootfs commented Feb 9, 2018

/assign @dchen1107 @derekwaynecarr for approval

continue
}
pvcUniqStr := pvcRef.Namespace + pvcRef.Name
if allPVCs.Has(pvcUniqStr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as though pvcRef can be used as a map key.

I would prefer to gather everything into a map[stats.PVCReference]stats.VolumeStats then iterate over that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbandy sets.String is implemented by a map. It has no performance issue here and we can generate metrics in one iteration.

// ignore if no PVC reference
continue
}
pvcUniqStr := pvcRef.Namespace + pvcRef.Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really unique... a/bc and ab/c would collide.

Copy link
Member Author

@cofyc cofyc Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbandy Thanks! I should use / to separate namespace and name. It's the way to generate unique key for objects in client-go cache store. Fixed now: fecff55#diff-ae1cd608f0ca9bcb432eeeca0cf3cde0R106
@gnufied Sorry, please take a look at again.

Volumes on each node changes, we should not only add PVC metrics into
gauge vector. It's better use a collector to collector metrics from
stats.
@cofyc cofyc force-pushed the fix_kubelet_volume_metrics branch from 9502c28 to fecff55 Compare February 11, 2018 15:48
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2018
@jsafrane
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2018
@jsafrane
Copy link
Member

@kubernetes/sig-node-pr-reviews, please approve

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 12, 2018
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Kubelet changes look good

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cofyc, derekwaynecarr, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit bfdd94c into kubernetes:master Feb 16, 2018
k8s-github-robot pushed a commit that referenced this pull request Feb 24, 2018
…upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #59170: Fix kubelet PVC metrics using a volume stats collector.

Cherry pick of #59170 on release-1.8.

#59170: Fix kubelet PVC metrics using a volume stats collector.
@cofyc cofyc deleted the fix_kubelet_volume_metrics branch March 14, 2018 03:42
k8s-github-robot pushed a commit that referenced this pull request Apr 12, 2018
…upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #59170: Fix kubelet PVC metrics using a volume stats collector.

Cherry pick of #59170 on release-1.9.

#59170: Fix kubelet PVC metrics using a volume stats collector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus metrics for PVC showing stale volume data