Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dynamic config metrics #57527

Merged
merged 1 commit into from
May 24, 2018
Merged

Conversation

mtaufen
Copy link
Contributor

@mtaufen mtaufen commented Dec 21, 2017

This PR exports config-releated metrics from the Kubelet.
The Guages for active, assigned, and last-known-good config can be used
to identify config versions and produce aggregate counts across several
nodes. The error-reporting Gauge can be used to determine whether a node
is experiencing a config-related error, and to prodouce an aggregate
count of nodes in an error state.

kubernetes/enhancements#281

The Kubelet now exports metrics that report the assigned (node_config_assigned), last-known-good (node_config_last_known_good), and active (node_config_active) config sources, and a metric indicating whether the node is experiencing a config-related error (node_config_error). The config source metrics always report the value 1, and carry the node_config_name, node_config_uid, node_config_resource_version, and node_config_kubelet_key labels, which identify the config version. The error metric reports 1 if there is an error, 0 otherwise.

@mtaufen mtaufen added area/kubelet area/kubelet-api area/monitoring do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Dec 21, 2017
@mtaufen mtaufen self-assigned this Dec 21, 2017
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 21, 2017
@mtaufen mtaufen added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 21, 2017
@mtaufen mtaufen force-pushed the kc-metric branch 3 times, most recently from e4fec46 to 1c6b5e2 Compare December 24, 2017 23:01
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 24, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 4, 2018
@mtaufen mtaufen changed the title (WIP) node config source metrics (WIP) PoC node config source metrics Jan 19, 2018
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2018
@mtaufen mtaufen changed the title (WIP) PoC node config source metrics PoC node config source metrics Jan 19, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 19, 2018
Copy link

@crassirostris crassirostris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach SGTM


func SetAssignedNodeConfigSource(val string) {
// TODO(mtaufen): will this eliminate previous metric, or generate a parallel one?
AssignedNodeConfigSource.WithLabelValues(val).Set(1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach, implemented like this, is going to result in memory leak, because old values are not GC'd, as you noted in the comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loburm @x13n FYI, this is the usecase for string metrics

AssignedNodeConfigSourceKey = "assigned_node_config_source"
// AssignedNodeConfigSourceOkayKey's metric indicates whether the currently assigned config source and the config it contains is valid
AssignedNodeConfigSourceOkayKey = "assigned_node_config_source_okay"
AssignedNodeConfigSourceStatusTrue = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use string values, performance considerations in this case do not outweigh the decreased readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, just report a string value in another label key-value pair?

Since prometheus doesn't support string valued metrics:
https://prometheus.io/docs/concepts/metric_types/
prometheus/prometheus#2227

// AssignedNodeConfigSourceKey's metric reports the config source the Kubelet believes is currently assigned to it
AssignedNodeConfigSourceKey = "assigned_node_config_source"
// AssignedNodeConfigSourceOkayKey's metric indicates whether the currently assigned config source and the config it contains is valid
AssignedNodeConfigSourceOkayKey = "assigned_node_config_source_okay"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IMHO ok is more widely used than okay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

utillog.Infof("starting controller")

// set metrics when all is said and done
var using string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using is a bad name, is says nothing and spans a huge scope

Also, if you change the logic below later, it may be used uninitialized

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that global defer at all, why not leave setting metrics local?

@@ -101,7 +122,7 @@ var (
Help: "Interval in microseconds between relisting in PLEG.",
},
)
// Metrics of remote runtime operations.
// Metrics for remote runtime operations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I see a slow convergence towards comments being complete sentences, with full stop at the end and so on.

DevicePluginRegistrationCountKey = "device_plugin_registration_count"
DevicePluginAllocationLatencyKey = "device_plugin_alloc_latency_microseconds"
// Metric keys for node config sources
// AssignedNodeConfigSourceKey's metric reports the config source the Kubelet believes is currently assigned to it
AssignedNodeConfigSourceKey = "assigned_node_config_source"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the assigned_node prefix? It's little big confusing, why not just config_source or dynamic_config_source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three config references you'd want the Kubelet to report:

  1. What it believes its currently assigned source is.
  2. What it believes its last-known-good source is.
  3. Which of the above two it's actually using.

NodeConfigSource is the type of the ConfigSource field on the Node object and is used to point the Kubelet at a config source, hence node_config_source.

Could do:

  1. node_config_source_assigned
  2. node_config_source_last_known_good
  3. node_config_source_using (or some name other than using)

AssignedNodeConfigSourceStatusUnknown = 0
AssignedNodeConfigSourceStatusFalse = -1
// LastKnownGoodNodeConfigSourceKey's metric indicates what the Kubelet is treating as it's last-known-good config source
LastKnownGoodNodeConfigSourceKey = "last_known_good_node_config_source"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to make these hierarchical, that helps, e.g. dynamic_config_last_known_source

@mtaufen
Copy link
Contributor Author

mtaufen commented Jan 23, 2018

Thanks @crassirostris for taking a look at this

@mtaufen
Copy link
Contributor Author

mtaufen commented Jan 23, 2018

/retest

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2018
@mtaufen mtaufen changed the title PoC node config source metrics add dynamic config metrics May 22, 2018
@mtaufen mtaufen removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2018
ActiveConfigKey = "node_config_active"
LastKnownGoodConfigKey = "node_config_last_known_good"
ConfigErrorKey = "node_config_error"
ConfigNameLabelKey = "node_config_name" // this is a fully resolved name, e.g. an API path or a URL, "Name" is short but a little confusing name for this key...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole any naming suggestions for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelfLink is the URL representing an object in ObjectMeta. Maybe "node_config_self_link"? This should be equivalent to the self link for the configmap, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole and I settled on node_config_source, as it accurately identifies what the corresponding value describes while being reasonably generic ("local", or a SelfLink, or maybe some URL in the future).

@mtaufen
Copy link
Contributor Author

mtaufen commented May 22, 2018

/retest

@mtaufen mtaufen added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 22, 2018
@mtaufen mtaufen added this to the v1.11 milestone May 22, 2018
@mtaufen
Copy link
Contributor Author

mtaufen commented May 22, 2018

/status approved-for-milestone

@mtaufen
Copy link
Contributor Author

mtaufen commented May 22, 2018

/remove-lifecycle stale

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@crassirostris @dashpole @dchen1107 @mtaufen

Pull Request Labels
  • sig/node: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Help

@mtaufen mtaufen force-pushed the kc-metric branch 3 times, most recently from fb81c29 to e951978 Compare May 22, 2018 19:51
@mtaufen
Copy link
Contributor Author

mtaufen commented May 22, 2018

/retest

This PR exports config-releated metrics from the Kubelet.
The Guages for active, assigned, and last-known-good config can be used
to identify config versions and produce aggregate counts across several
nodes. The error-reporting Gauge can be used to determine whether a node
is experiencing a config-related error, and to prodouce an aggregate
count of nodes in an error state.
@dashpole
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2018
@dchen1107
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole, dchen1107, mtaufen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64013, 63896, 64139, 57527, 62102). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 731eaec into kubernetes:master May 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/kubelet-api area/monitoring cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants