-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus cluster monitoring addon. #62195
Conversation
3a42553
to
5d5e7f4
Compare
/ok-to-test |
cluster/gce/gci/configure-helper.sh
Outdated
@@ -2104,14 +2104,17 @@ EOF | |||
prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml" | |||
setup-addon-manifests "addons" "kube-proxy" | |||
fi | |||
if [[ "${ENABLE_CLUSTER_MONITORING:-}" != "none" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer if [[ "${ENABLE_CLUSTER_MONITORING:-}" == "prometheus" ]]
. It makes sense to have fully separated conditions for prometheus and other monitoring systems, because this is the only one that doesn't use heapster. Please add some comments to make this separation clear, e.g. "set up cluster monitoring using prometheus" and "set up cluster monitoring using heapster"
namespace: kube-system | ||
labels: | ||
kubernetes.io/cluster-service: "true" | ||
addonmanager.kubernetes.io/mode: EnsureExists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually intend users to modify this, i.e. other parts than storage request?
@@ -0,0 +1,190 @@ | |||
--- | |||
apiVersion: v1 | |||
kind: ConfigMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include some reference for the format of this config map?
@@ -0,0 +1,190 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please skip unnecessary separator lines like this (all first lines)
The deployments look fine, I'll take a look at the e2e tests on Monday. Can you split this PR to two commits: deplyments and tests? |
@@ -0,0 +1,86 @@ | |||
--- | |||
apiVersion: extensions/v1beta1 | |||
kind: Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what is prometheus cpu/memory usage and whether we can rely on defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot rely on defaults in kube-system, I will prepare them.
- replacement: kubernetes.default.svc:443 | ||
target_label: __address__ | ||
- regex: (.+) | ||
replacement: /api/v1/nodes/${1}/proxy/metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a good idea to advocate for to users. People will look at this and think this is the recommended way to run this, but in reality it's giving close to root access to the Prometheus pod to all kubelets, that doesn't seem like a good idea. I would prefer cert or token based authN + authZ from the kubelet.
We're discussing how to remove this from the example in the Prometheus repo. tl;dr people are asking this to stay as GKE doesn't have another possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like kube-up has http endpoints enabled for kubelet. I will use it as temporary solution and work on authorization in meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brancz Is using unencrypted metric endpoints acceptable for first version? I plan to support this addon through changes into kubelet metrics. For this PR I wanted to move current community solution for prometheus into addon, but enhance it with e2e tests.
test/utils/runners.go
Outdated
@@ -127,6 +127,7 @@ type RCConfig struct { | |||
ReadinessProbe *v1.Probe | |||
DNSPolicy *v1.DNSPolicy | |||
PriorityClassName string | |||
PodAnnotations map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is similar to labels, so probably it would fit better after Labels field
return fmt.Sprintf(`sum(QPS{kubernetes_namespace="%s",kubernetes_pod_name=~"%s.*"})`, namespace, podNamePrefix) | ||
} | ||
|
||
func retryUntil(predicate func() bool, timeout time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider some logical ordering of functions, i.e. move helper methods below test logic.
/lgtm |
Fixed typo in relabeling schema. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, just two suggestions
{} | ||
prometheus.yml: | | ||
rule_files: | ||
- /etc/config/rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why specify rules and alert files if they are then left empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
memory: 10Mi | ||
|
||
- name: prometheus-server | ||
image: "prom/prometheus:v2.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v2.1.0 had a variety of problems, I'd recommend v2.2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
What I'm trying to understand is do we exclusively want to use this for the e2e tests of custom metrics or promote this as an official addon? If the latter then I'm not sure I'm comfortable with this. For testing purposes the insecure port is totally fine, but in production environments it's not what I would recommend. Personally I'd of course like to see this done with the Prometheus Operator 😉 . |
@brancz This is disabled by default. Can we comment it better to make users aware of the issues? What's your recommendation? I'd prefer to merge this to get e2e tests running. |
The Prometheus Operator is soon going to implement the TokenRequest API, in order to use tokens for specific audiences. As far as I understand token impersonation is why token auth on kubelets is not enabled by GKE today (RE: #57997). So until TokenRequests are available it won't be possible on GKE. This is somewhat reasonable I guess (although personally I feel the kubelet is higher privileged than Prometheus, so impersonation would not be a security concern, but I don't want to start that discussion here, also I'm happy to be proven wrong, I admit I haven't analyzed the security situation to its fullest). Eventually I'd prefer to see this Prometheus Operator based as it solves a lot of operational needs of Prometheus (the TokenRequest being only one example, which is unlikely to land in Prometheus itself). Also we maintain a rather exhaustive setup already to perform cluster monitoring, which we have productionized on top of OpenShift and are planning to add support for vanilla Kubernetes as well. tl;dr I'm ok with this state for now, but I'd prefer if we don't commit to this in the long term as we already know of shortcomings of this and converge to a Prometheus Operator based setup. (disclaimer I'm one of the maintainers of the Prometheus Operator) |
/lgtm @piosz and @serathius should be able to contribute more to discussion about using Prometheus Operator. I'm not fully aware of the benefits of Prometheus Operator that are already available, @serathius has been investigating this more, i.e. he raised a concern that we may need some wider review of Prometheus Operator CRDs. |
I didn't carefully review neither e2e test not those yaml. /approve no-issue |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kawych, serathius, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
I would like us to discuss in more detail what we want to achieve here. From the sig-instrumentation meeting two weeks ago it I was under the impression that all we wanted to do is a very simple setup purely to validate in e2e tests that Kubernetes SD and other integrations like the custom metrics monitoring pipeline are not totally broken. A fully fledged cluster monitoring addon is another story. I would like us to reconsider this. |
This PR adds new cluster monitoring addon based on prometheus.
It adds prometheus deployment with e2e tests.
Additional components will be added iterativly in future.
Manifests based on current Helm chart.
At current state it's not intended for production use.
cc @piosz @kawych @miekg
/sig instrumentation
/kind feature
/priority important-soon