Add self anti-affinity to kube-dns pods #57683

ghost · 2017-12-28T09:24:44Z

Otherwise the "no single point of failure" setting doesn't actually work (a single node failure can still take down the entire cluster).

Fixes #40063

Added anti-affinity to kube-dns pods

cblecker · 2017-12-28T19:06:43Z

cluster/addons/dns/kube-dns.yaml.base

@@ -92,6 +92,16 @@ spec:
        configMap:
          name: kube-dns
          optional: true
+      affinity:
+        podAntiAffinity:
+          requiredDuringSchedulingIgnoredDuringExecution:


Definitely shouldn't be required. We shouldn't block on scheduling if for some reason (resources or otherwise) we can't meet the affinity requirements. Can you swap this for preferredDuringSchedulingIgnoredDuringExecution?

Hmm. I'll try to explain my reasoning behind required. I started looking at this since I ran into a situation where all three replicas of kube-dns were running on the same node. That node failed, taking the cluster with it.

From the reliability POV there is never a reason to run multiple replicas of kube-dns on the same node (running one and having the rest pending is the same as running many on one node, as far as reliability is concerned).

The other POV is probably performance, and for that the underlying node doesn't matter much. For a perfect result we'd need "all replicas cannot be on the same node", but I don't think it's possible to specify that..

I'll let @bowei and @MrHohn weigh in, but neither POV is worth blocking scheduling IMO. The auto-scaler should pick the right number of replicas. Preferring to not schedule them on the same node makes sense. Refusing to schedule them because there is an existing replica already there doesn't make kube-dns more resilient. preferredDuringSchedulingIgnoredDuringExecution will not schedule two kube-dns pods on the same node unless there is no other node available to satisfy the affinity rule.

In my experience it is quite common for a densely populated cluster to momentarily be in a state where pods can only fit on the "wrong" node. Based on this I'm not sure preferred is a complete fix for this problem.

I would prefer to use this patch which is from the previous release cycle. Also, there is a lot of comments regarding why preferred is to be used than required. You will also need to make a corresponding change to kubeadm.

#52193

@bowei could you please elaborate a bit more why you prefer preferred over required as me #57683 (comment) and others #57683 (comment), https://blog.openai.com/scaling-kubernetes-to-2500-nodes/#kubemasters don't see it this way and I would like to understand why you are preferring preferred. The linked PR seems also not to include any details.

dims · 2018-01-04T15:53:15Z

/ok-to-test

bowei · 2018-01-04T19:55:58Z

This was brought up before, but the anti-affinity scheduler logic is not scalable to large # of nodes at the moment which means we cannot apply the feature to kube-dns by default. When that is fixed, we can add this.

MrHohn · 2018-01-08T19:32:31Z

Ref the previous attempt: #52193.

Seems like the issue to track anti-affinity fast path is closed: #54189. @bsalamat does it mean we are ready to give it another shot?

bsalamat · 2018-01-09T00:21:28Z

Seems like the issue to track anti-affinity fast path is closed: #54189. @bsalamat does it mean we are ready to give it another shot?

@MrHohn Yes. The PR is merged and now the head has the optimization for affinity/anti-affinity whose toplogyKey is kubernetes.io/hostname. So, in the next release we should be able to use the rule with reasonable performance for kube-dns.

cblecker · 2018-01-09T02:05:13Z

Awesome! Thanks @bsalamat!
@bowei @MrHohn: Any thoughts on preferredDuringScheduling vs requiredDuringScheduling? (per the conversation above)

MrHohn · 2018-01-09T04:21:00Z

Any thoughts on preferredDuringScheduling vs requiredDuringScheduling? (per the conversation above)

@cblecker From previous discussion I think folks agreed on using preferredDuringScheduling. Ref #52193 (comment).

johanneswuerbach · 2018-01-09T07:06:25Z

I would agree with @vainu-arto that requiredDuringScheduling would be always preferable in our case as having all kube-dns pods located on the same node creates a single point of failure and kind of defeats the preventSinglePointFailure setting of the kube-dns-autoscaler.

We use autoscaling quite heavily and from time to time end up with all two kube-dns pods on the same node.

What would be the advantage of using preferredDuringScheduling, except that less kube-dns pods would be pending? Shouldn't any performance concerns be covered by the autoscaler https://github.com/kubernetes/kubernetes/blob/578881/cluster/addons/dns-horizontal-autoscaler/dns-horizontal-autoscaler.yaml#L93 and better be circumvented by scaling up earlier instead of relying on always being schedulable?

bsalamat · 2018-01-09T19:01:33Z

IMO, requiredDuringScheduling is more appropriate for kube-dns. Note that preferredDuringScheduling is a "priority" function in scheduler. "Priority" functions are essentially scoring functions. Each assign a score to a node for running a given pod and then the scores are aggregated to find the best node. Scheduler has some default scoring functions, such as spreading, etc., which are combined with the user requested preferences. This causes the "priority" functions to be much less reliable.
I think running multiple instances of kube-dns on a single node has very limited value anyway. So, I would say even if some instances remain pending due to anti-affinity, it shouldn't be much worse than running multiple of them on a single node.

BTW, credit for implementing the anti-affinity optimization goes to @misterikkit.

bowei · 2018-01-23T19:13:19Z

Using "required" will result in a behavior change: if you have less than # nodes < requested replicas, you will have DNS pods that cannot be scheduled. There may be reasons to scale replicas for performance as well as fault tolerance reasons. It would be more prudent to first go to preferred first and see what the user experience is than to jump to required immediately.

cblecker · 2018-01-23T19:21:01Z

I agree, @bowei.

@vainu-arto Can you make this change to the PR? Thanks!

jordanjennings · 2018-01-23T23:45:55Z

The choice of preferred vs required basically is a choice on which users to prioritize. Do you prioritize the users who do frequent autoscaling / rolling updates (and may end up with DNS stacked on a single node), or do you prioritize users that need more than one kube-dns per node for performance reasons? Is there any sense of which of these groups of users is actually more prominent? My gut tells me group 1 (autoscaling) is far more common than group 2 (performance), and in the use cases where performance is the bottleneck some fine tuning would be needed anyway.

From my perspective there's little value in merging this unless it keeps requiredDuringScheduling. This PR was opened because of an outage that likely wouldn't have been fixed by using preferredDuringScheduling. That said, I get that from a "principle of least surprise" viewpoint it's less of a change to go to preferredDuringScheduling.

This reverts commit 607c3d6.

ghost · 2018-01-24T08:19:29Z

How about this, a revert of the revert of the previous attempt by @StevenACoffman to add preferred anti-affinity to kube-dns pods?

I still think preferred isn't strong enough for this case, but I'll try to start here in the interest of not letting the perfect be the enemy of the good (or at least better than without any anti-affinity).

MaciekPytel · 2018-01-24T15:12:34Z

Since autoscaling is a common argument in this thread. Pod affinity/antiaffinity is absolutely horrible for Cluster Autoscaler performance and we recommend our users to never ever use it in large clusters. MatchInterPodAffinity predicate is ~100 times slower than all other predicates combined even if no pods actually use pod affinity / antiaffinity (kubernetes/autoscaler#257)*. Because of this we completely disable the predicate if there are no pods using pod affinity / antiaffinity in cluster. Even a single pod (kube-dns) using it will force us to actually check it and take an associated massive performance hit.

This is only a case for requiredDuringScheduling version of antiaffinity. CA only runs scheduler predicates and not priorities, so it doesn't care about preferredDuringScheduling. So I'm very much in favor of using preferredDuringScheduling, as we will end up telling everyone who uses autoscaling to change it to preferredDuringScheduling anyway.

We need to come up with some long term solution for pod affinity performance, but that's a separate discussion. I'll create a separate issue for that.

====

*) This was measured before performance improvements done in #54189, but I doubt they will help us all that much. The fundamental issue is that contrary to scheduler CA simulates different clusters all the time, so it cannot precompute and maintain information about antiaffinity of already running pods (more precisely - either satisfiesExistingPodsAntiAffinity method kills us, because we don't have predicateMeta, or we have to calculate predicateMeta for different simulated clusters all the time, with the same result). After a quick glance at recent changes I don't see anything that would address this problem.

cblecker · 2018-01-24T18:16:41Z

/retest

ghost · 2018-01-31T14:21:08Z

/retest

bowei · 2018-01-31T18:28:01Z

/lgtm

k8s-ci-robot · 2018-01-31T18:28:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, vainu-arto

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~cluster/addons/dns/OWNERS~~ [bowei]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

fejta-bot · 2018-01-31T23:31:31Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.