Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus cluster monitoring addon. #62195

Merged
merged 2 commits into from
Apr 18, 2018

Conversation

serathius
Copy link
Contributor

@serathius serathius commented Apr 6, 2018

This PR adds new cluster monitoring addon based on prometheus.
It adds prometheus deployment with e2e tests.
Additional components will be added iterativly in future.
Manifests based on current Helm chart.
At current state it's not intended for production use.

cc @piosz @kawych @miekg

Add prometheus cluster monitoring addon to kube-up

/sig instrumentation
/kind feature
/priority important-soon

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 6, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 6, 2018
@serathius serathius force-pushed the prometheus branch 2 times, most recently from 3a42553 to 5d5e7f4 Compare April 6, 2018 14:15
@dims
Copy link
Member

dims commented Apr 6, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 6, 2018
@@ -2104,14 +2104,17 @@ EOF
prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml"
setup-addon-manifests "addons" "kube-proxy"
fi
if [[ "${ENABLE_CLUSTER_MONITORING:-}" != "none" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer if [[ "${ENABLE_CLUSTER_MONITORING:-}" == "prometheus" ]]. It makes sense to have fully separated conditions for prometheus and other monitoring systems, because this is the only one that doesn't use heapster. Please add some comments to make this separation clear, e.g. "set up cluster monitoring using prometheus" and "set up cluster monitoring using heapster"

namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually intend users to modify this, i.e. other parts than storage request?

@@ -0,0 +1,190 @@
---
apiVersion: v1
kind: ConfigMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include some reference for the format of this config map?

@@ -0,0 +1,190 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please skip unnecessary separator lines like this (all first lines)

@kawych
Copy link
Contributor

kawych commented Apr 6, 2018

The deployments look fine, I'll take a look at the e2e tests on Monday. Can you split this PR to two commits: deplyments and tests?

@@ -0,0 +1,86 @@
---
apiVersion: extensions/v1beta1
kind: Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what is prometheus cpu/memory usage and whether we can rely on defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot rely on defaults in kube-system, I will prepare them.

- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics
Copy link
Member

@brancz brancz Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea to advocate for to users. People will look at this and think this is the recommended way to run this, but in reality it's giving close to root access to the Prometheus pod to all kubelets, that doesn't seem like a good idea. I would prefer cert or token based authN + authZ from the kubelet.

We're discussing how to remove this from the example in the Prometheus repo. tl;dr people are asking this to stay as GKE doesn't have another possibility.

Copy link
Contributor Author

@serathius serathius Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like kube-up has http endpoints enabled for kubelet. I will use it as temporary solution and work on authorization in meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brancz Is using unencrypted metric endpoints acceptable for first version? I plan to support this addon through changes into kubelet metrics. For this PR I wanted to move current community solution for prometheus into addon, but enhance it with e2e tests.

@@ -127,6 +127,7 @@ type RCConfig struct {
ReadinessProbe *v1.Probe
DNSPolicy *v1.DNSPolicy
PriorityClassName string
PodAnnotations map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is similar to labels, so probably it would fit better after Labels field

return fmt.Sprintf(`sum(QPS{kubernetes_namespace="%s",kubernetes_pod_name=~"%s.*"})`, namespace, podNamePrefix)
}

func retryUntil(predicate func() bool, timeout time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider some logical ordering of functions, i.e. move helper methods below test logic.

@kawych
Copy link
Contributor

kawych commented Apr 12, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018
@serathius
Copy link
Contributor Author

Fixed typo in relabeling schema.

@kawych
Copy link
Contributor

kawych commented Apr 12, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, just two suggestions

{}
prometheus.yml: |
rule_files:
- /etc/config/rules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why specify rules and alert files if they are then left empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

memory: 10Mi

- name: prometheus-server
image: "prom/prometheus:v2.1.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2.1.0 had a variety of problems, I'd recommend v2.2.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@brancz
Copy link
Member

brancz commented Apr 12, 2018

What I'm trying to understand is do we exclusively want to use this for the e2e tests of custom metrics or promote this as an official addon? If the latter then I'm not sure I'm comfortable with this. For testing purposes the insecure port is totally fine, but in production environments it's not what I would recommend.

Personally I'd of course like to see this done with the Prometheus Operator 😉 .

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2018
@kawych
Copy link
Contributor

kawych commented Apr 16, 2018

@brancz
We want to promote it as official addon, an alternative to other monitoring systems implemented by Heapster. We had a discussion about the insecure port, I don't think we came up with a good alternative for this, but it certainly has to be solved at some point. (i.e. Do you know how Prometheus Operator handles it?).

This is disabled by default. Can we comment it better to make users aware of the issues? What's your recommendation? I'd prefer to merge this to get e2e tests running.

@brancz
Copy link
Member

brancz commented Apr 16, 2018

The Prometheus Operator is soon going to implement the TokenRequest API, in order to use tokens for specific audiences. As far as I understand token impersonation is why token auth on kubelets is not enabled by GKE today (RE: #57997). So until TokenRequests are available it won't be possible on GKE. This is somewhat reasonable I guess (although personally I feel the kubelet is higher privileged than Prometheus, so impersonation would not be a security concern, but I don't want to start that discussion here, also I'm happy to be proven wrong, I admit I haven't analyzed the security situation to its fullest).

Eventually I'd prefer to see this Prometheus Operator based as it solves a lot of operational needs of Prometheus (the TokenRequest being only one example, which is unlikely to land in Prometheus itself). Also we maintain a rather exhaustive setup already to perform cluster monitoring, which we have productionized on top of OpenShift and are planning to add support for vanilla Kubernetes as well.

tl;dr I'm ok with this state for now, but I'd prefer if we don't commit to this in the long term as we already know of shortcomings of this and converge to a Prometheus Operator based setup.

(disclaimer I'm one of the maintainers of the Prometheus Operator)

@serathius
Copy link
Contributor Author

/cc @gmarek @roberthbailey

@kawych
Copy link
Contributor

kawych commented Apr 17, 2018

/lgtm
@brancz thank you for explanations. From my knowledge, token auth is going to be enabled (see #58178), as discussed with @serathius we can move away from insecure port in a follow-up PR.

@piosz and @serathius should be able to contribute more to discussion about using Prometheus Operator. I'm not fully aware of the benefits of Prometheus Operator that are already available, @serathius has been investigating this more, i.e. he raised a concern that we may need some wider review of Prometheus Operator CRDs.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2018
@wojtek-t
Copy link
Member

I didn't carefully review neither e2e test not those yaml.
I looked into glue-ing code and that looks fine.

/approve no-issue

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, serathius, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit bb8f58b into kubernetes:master Apr 18, 2018
@brancz
Copy link
Member

brancz commented Apr 19, 2018

I would like us to discuss in more detail what we want to achieve here. From the sig-instrumentation meeting two weeks ago it I was under the impression that all we wanted to do is a very simple setup purely to validate in e2e tests that Kubernetes SD and other integrations like the custom metrics monitoring pipeline are not totally broken. A fully fledged cluster monitoring addon is another story. I would like us to reconsider this.

@serathius serathius deleted the prometheus branch July 11, 2020 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants