[Federation] Implement leader election for controller-manager #46090

shashidharatd · 2017-05-19T09:32:24Z

What this PR does / why we need it:

Add cluster-selector for namespace
Add support for creating federation-only objects. - Ref federation: Support creating resources only in federation control plane #44631
Implements leader election for controller-manager.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #44490

Special notes for your reviewer:
This PR also fixes the issue #44490, which is about delay in initializing controller-manager due to unavailability of api-server.

Release note:

federation: Support for leader-election among federation controller-manager instances introduced.

/cc @kubernetes/sig-federation-pr-reviews

shashidharatd · 2017-05-19T09:38:17Z

/assign @nikhiljindal
/assign @marun

nikhiljindal · 2017-05-26T18:34:29Z

Sorry I havent had a chance to go through the design yet.
Given that these are last few days of the release, I dont think I will be able to devote much time to this.

If there is anyone else who feels comfortable reviewing this, please go ahead.

ghost · 2017-06-07T18:26:57Z

federation/pkg/federation-controller/namespace/namespace_controller.go

+			return
+		} else if !send {
+			glog.V(5).Infof("Skipping cluster: %s for namespace: %s reason: cluster selectors do not match: %-v %-v", cluster.Name, namespace, cluster.ObjectMeta.Labels, desiredNamespace.ObjectMeta.Annotations[federationapi.FederationClusterSelectorAnnotation])
+			continue


This causes you to skip over the switch statement below, and in particular the "case found && !send" so you do not end up deleting the namespace in an underlying cluster when you should.

oops, a mistake. will fix. Thanks for catching this.

ghost · 2017-06-07T18:32:03Z

federation/cmd/federation-controller-manager/app/controllermanager.go

@@ -54,6 +58,13 @@ import (
 	"k8s.io/client-go/discovery"
 )

+const (
+	apiserverWaitTimeout   = 30 * 2 * time.Second


nit: Why not just 60 seconds, or one minute?

I kind of forgot why i did it like that. will change.

ghost · 2017-06-07T18:32:56Z

federation/cmd/federation-controller-manager/app/controllermanager.go

@@ -54,6 +58,13 @@ import (
 	"k8s.io/client-go/discovery"
 )

+const (
+	apiserverWaitTimeout   = 30 * 2 * time.Second
+	apiserverRetryInterval = 1 * time.Second


This seems very short, and requires exponential backoff. There are standard libraries to do the right thing here.

We should be initializing the controller-manager as early as possible. controller-manager depends on api-server and they both are started independently. So either one may start functioning before the other. If we introduce exponential backoff, i am worried it may induce unnecessary delay in initializing controller-manager. Also it is not the critical path and we are just in initialization stage.
I would increase the apiserverRetryInterval to 2 seconds. and the apiserverWaitTimeout to 2 minutes. Is this fine?

ghost · 2017-06-07T18:35:21Z

federation/cmd/federation-controller-manager/app/controllermanager.go

+	apiserverWaitTimeout   = 30 * 2 * time.Second
+	apiserverRetryInterval = 1 * time.Second
+
+	federationOnlyNamespace = "federation"


This needs to be configurable, and should be an explicitly reserved name. As it stands right now, if the user has a namespace called federation in any of their underlying clusters, these will be automatically deleted (because what you're saying here is that you only want it in the federation, not in any of the underlying clusters).

Agree with your comment. I would make it configurable. Also what could be the name for such a namespace in case of default. is federation name fine? or should i changed it to something like federation-only

Yes, I think federation-only is fine as a default.

ghost · 2017-06-07T18:38:58Z

federation/cmd/federation-controller-manager/app/controllermanager.go

+		glog.Fatalf("Invalid API configuration: %v", err)
+	}
+
+	if err := createFederationNamespace(federationClientset, federationOnlyNamespace); err != nil {


You should only do this if the user has requested an HA federation control plane, otherwise you have an unnecessary namespace hanging around, for no reason.

Agree with the concern, but even in case of non-HA federation control plane, it has one purpose to wait for the Api Server and to establish it is working fine. In the current scenario, without this. the federation-controller-manager dies and another pod is brought up. This is documented in this issue #44490

Now creating the federation-only namespace only in HA scenario. PTAL

ok, thanks.

ghost · 2017-06-07T18:42:03Z

federation/cmd/federation-controller-manager/app/controllermanager.go

@@ -236,3 +256,33 @@ func hasRequiredResources(serverResources []*metav1.APIResourceList, requiredRes
 	}
 	return true
 }
+
+func createFederationNamespace(clientset *federationclientset.Clientset, namespace string) error {


This should be called "ensureFederationNamespace" because that's what it does. If the namespace already exists, it does nothing, otherwise it tries to create it.

Agree, will change

ghost · 2017-06-07T18:51:43Z

federation/cmd/federation-controller-manager/app/controllermanager.go

+
+	if !s.LeaderElection.LeaderElect {
+		run(nil)
+		panic("unreachable")


I don't understand why this is needed. It's already in run()?

agree, its an overkill. will remove it.

ghost · 2017-06-07T18:54:53Z

federation/cmd/federation-controller-manager/app/controllermanager.go

+		Callbacks: leaderelection.LeaderCallbacks{
+			OnStartedLeading: run,
+			OnStoppedLeading: func() {
+				glog.Fatalf("leaderelection lost")


This is normal, expected behaviour, right? In that case we should not be logging a fatal error here.

So you die if you lose the leader election here? Are relying on Kubernetes to automatically restart this pod to try again? This seems to introduce potential delays and also it seems brittle. What if Kubernetes decides to stop restarting this pod because it's failed too often? And what if, because this pod is dead, Kubernetes decides to schedule another pod on this node, and uses up all the resources? I guess in summary I don't understand the design for how leader election is supposed to work. Can you point me to the design doc?

if the instance which is leader loses leader election it would die and restart. potentially it has encountered some issues due to which it could not renew the leadership lease. Also it should be ok if it restarts as there are standby instances which could acquire lease immediately after one is available.
@quinton-hoole, there was no design for leader-election for controller-manager as it was thought to be trivial. Already there are multiple examples available on how to implement it. here is the one for kube-controller-manager

However there is Federation control plane HA design, which we need to revisit in v1.8 and gather acceptance from community. https://docs.google.com/document/d/18kB5W401WBhEjUOOXQnBb-nx3bFXK8pOCHbMbdXuBMg/edit?usp=sharing

OK, that makes sense. If the kubelet doesn't restart the pod, then the replicset controller will, perhaps on a different node. Sorry, I was being dumb.

ghost · 2017-07-17T22:18:06Z

@shashidharatd Is the plan still to merge this, or can it be closed?

shashidharatd · 2017-07-18T12:11:26Z

@quinton-hoole, @marun, I have reworked and addressed the comments. can you PTAL.

ghost · 2017-08-03T19:02:20Z

/lgtm
/approve

fejta-bot · 2017-08-03T22:18:59Z