Skip to content

[FEATURE] instance-manager compatibility with Cluster Autoscaler #2203

Closed
@drewhemm

Description

@drewhemm

Is your feature request related to a problem? Please describe.
When using Cluster Autoscaler, nodes are not removed from the cluster if they have bare pods; an annotation can be added to pods to let CA know it is safe to evict them:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

A node that has no other workloads running should be a candidate for removal from the cluster, but alas:

I0121 16:44:11.071945       1 cluster.go:93] Fast evaluation: ip-10-xxx-xx-xxx.eu-west-1.compute.internal for removal
I0121 16:44:11.071966       1 cluster.go:107] Fast evaluation: node ip-10-xxx-xx-xxx.eu-west-1.compute.internal cannot be removed: longhorn-system/instance-manager-e-1c19a568 is not replicated

Describe the solution you'd like
The option to specify annotations to be added to instance-manager pods would solve this issue. When installing Longhorn via Helm, there is the option to provide annotations for the ingress in the values.yaml - similar functionality would be desirable.

Describe alternatives you've considered
A mutating webhook might be an option, but that is a lot of work.

Additional context
Without this functionality, it is not possible to use Longhorn in conjuction with Cluster Autoscaler as it CA will be blocked from removing any nodes from the cluster.

Activity

drewhemm

drewhemm commented on Jan 21, 2021

@drewhemm
Author

I tried to find the source code for where the pod spec is defined, but could not find it.

joshimoo

joshimoo commented on Jan 21, 2021

@joshimoo
Contributor

If you want to dynamically remove nodes from the cluster you want to use dedicated storage nodes, otherwise you risk removing the last replica of your data.

https://longhorn.io/docs/1.1.0/volumes-and-nodes/storage-tags/
https://longhorn.io/docs/1.1.0/advanced-resources/default-disk-and-node-config/

We implement a poddisruption budget to prevent eviction of the last healthy replica. To prevent a user from doing just that.
We cannot add the above since it's not safe to delete the nodes unless we know there are other nodes with a healthy replica of the data.

self-assigned this
on Jan 21, 2021
drewhemm

drewhemm commented on Jan 21, 2021

@drewhemm
Author

Thanks for the quick reply. The problem is not the nodes being used to host the replicas, it's all the other nodes in the cluster - every one gets a longhorn pod, which I assume is to enable them to mount the volumes for the pods that they are hosting. instance-manager-e-* pods are the problem, not the instance-manager-r-* ones.

My 'longhorn' nodes are static, but I want every other node to be able to dynamically come and go when needed.

munjalpatel

munjalpatel commented on Jan 22, 2021

@munjalpatel

I agree with @drewhemm . I am facing the same exact problem in my AKS since I introduced longhorn to the cluster a week ago.

After spending an entire day troubleshooting the issue "why autoscaler does not mark a node for deletion that has nothing but instance-manager-e-* and instance-manager-r-* pods?" we were able to pinpoint the root cause to be these two longhorn pods from the following autoscaler logs:

longhorn-system/instance-manager-r-f5b7da98 and  longhorn-system/instance-manager-e-698e6cf0 are not replicated

image

drewhemm

drewhemm commented on Jan 22, 2021

@drewhemm
Author

I have come across a project called kube-annotate, which may provide a short-term solution. It uses a mutating admission controller to perform rule-based annotation of pods; I should be able to enforce annotation of the instance-manager-e-* pods with that. Would prefer if this functionality were available without the use of a third-party app though.

Is there a particular reason bare pods are used instead of a DaemonSet?

munjalpatel

munjalpatel commented on Jan 26, 2021

@munjalpatel

Folks, I would really appreciate some help here. We really want (need) to use longhorn.

However, this issue is significant since it completely prevents autoscaler to scale down.
We have a large cluster where proper functioning of autoscaler is vital for cost optimization and has significant material impact on business.

I am fairly sure as more folks start utilizing amazing powers of longhorn, they will very likely run into this issue as well.

joshimoo

joshimoo commented on Jan 27, 2021

@joshimoo
Contributor

@munjalpatel please ensure you use always available dedicated storage nodes, if you plan on dynamically adding/removing nodes from your cluster. See for how to set this up: #2203 (comment)

For the moment you can use the workaround posted by @drewhemm here: #2203 (comment)

To allow autoscaler to remove the nodes, please do not remove instance-manager-r-* pods since that will potential lead to data loss, therefore it's important to use always available dedicated storage nodes for this type of setup.

drewhemm

drewhemm commented on Jan 27, 2021

@drewhemm
Author

Hi @munjalpatel, I am working on a fork of kube-annotate, which we will use to workaround this issue:

https://github.com/footprintmediaits/kube-annotate/tree/refactor-k8s-1.18

Will post back here once it is ready for use. There were some issues in the original repo, plus the code was outdated and not working with more recent versions of k8s.

drewhemm

drewhemm commented on Jan 27, 2021

@drewhemm
Author

Well this is highly frustrating! We have got kube-annotate working, so that it adds the necessary annotation to the instance-manager-e pods:

$ k describe pod instance-manager-e-086cec4d -n longhorn-system 
Name:                 instance-manager-e-086cec4d
Namespace:            longhorn-system
Priority:             1000000000
Priority Class Name:  longhorn-node-critical
Node:                 ip-10-xxx-xx-xxx.eu-west-1.compute.internal/10.xxx.xx.xxx
Start Time:           Wed, 27 Jan 2021 14:30:36 +0000
Labels:               longhorn.io/component=instance-manager
                      longhorn.io/instance-manager-image=imi-a7211c27
                      longhorn.io/instance-manager-type=engine
                      longhorn.io/node=ip-10-xxx-xx-xxx.eu-west-1.compute.internal
Annotations:          cluster-autoscaler.kubernetes.io/safe-to-evict: true

...but now we hit the pod disruption budget - there is one created for each pod:

I0127 15:03:29.833597       1 cluster.go:107] Fast evaluation: node ip-10-xxx-xx-xxx.eu-west-1.compute.internal cannot be removed: not enough pod disruption budget to move longhorn-system/instance-manager-e-4d60be69
$ k get pdb -n longhorn-system
NAME                          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
instance-manager-e-086cec4d   1               N/A               0                     31m
instance-manager-e-28bdfdc4   1               N/A               0                     31m
instance-manager-e-3a2f58a6   1               N/A               0                     8m50s
instance-manager-e-4d60be69   1               N/A               0                     8m41s
instance-manager-e-517b42e1   1               N/A               0                     31m
instance-manager-e-58ee7223   1               N/A               0                     31m
instance-manager-e-59f3cdbf   1               N/A               0                     31m
instance-manager-e-5bbb6c47   1               N/A               0                     31m
instance-manager-e-6c38fcd9   1               N/A               0                     31m
instance-manager-e-7bc17525   1               N/A               0                     30m
instance-manager-e-9f3d2f20   1               N/A               0                     31m

I must ask again: why are these deployed as bare pods instead of using a daemonset?

drewhemm

drewhemm commented on Jan 27, 2021

@drewhemm
Author

Okay, I managed to resolve this. I modified the generateInstanceManagerPDBManifest function to set the minAvailable for the instance-manager-e pods to 0:

func (imc *InstanceManagerController) generateInstanceManagerPDBManifest(im *longhorn.InstanceManager) *policyv1beta1.PodDisruptionBudget {
+       var minAvailable int32
+       switch im.Spec.Type {
+       case types.InstanceManagerTypeEngine:
+               minAvailable = 0
+       case types.InstanceManagerTypeReplica:
+               minAvailable = 1
+       }
+
        return &policyv1beta1.PodDisruptionBudget{
                ObjectMeta: metav1.ObjectMeta{
                        Name:      imc.getPDBName(im),
@@ -636,7 +644,7 @@ func (imc *InstanceManagerController) generateInstanceManagerPDBManifest(im *lon
                        Selector: &metav1.LabelSelector{
                                MatchLabels: types.GetInstanceManagerLabels(imc.controllerID, im.Spec.Image, im.Spec.Type),
                        },
-                       MinAvailable: &intstr.IntOrString{IntVal: 1},
+                       MinAvailable: &intstr.IntOrString{IntVal: minAvailable},
                },
        }
 }
$ k get pdb -n longhorn-system                                                                                                   
NAME                          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
instance-manager-e-1da759a2   0               N/A               1                     7m20s
instance-manager-e-1f738b53   0               N/A               1                     7m15s
instance-manager-e-3bff3057   0               N/A               1                     7m10s
instance-manager-e-3fc08fa7   0               N/A               1                     7m19s
instance-manager-e-4e360a47   0               N/A               1                     7m11s
instance-manager-e-58203680   0               N/A               1                     7m7s
instance-manager-e-64701ec6   0               N/A               1                     7m17s
instance-manager-e-8dca7df2   0               N/A               1                     7m18s
instance-manager-e-9d3d5270   0               N/A               1                     7m20s
instance-manager-e-9f04dedc   0               N/A               1                     7m22s
instance-manager-e-a1db56ad   0               N/A               1                     7m17s
instance-manager-e-ca1649ac   0               N/A               1                     7m1s
instance-manager-e-cba916da   0               N/A               1                     7m19s
instance-manager-e-cea9a16f   0               N/A               1                     7m20s
instance-manager-e-dcea3dd2   0               N/A               1                     7m21s
instance-manager-e-e733cb65   0               N/A               1                     7m24s
instance-manager-e-e7406741   0               N/A               1                     7m21s
instance-manager-e-fc672c06   0               N/A               1                     7m21s
instance-manager-r-22bbd375   1               N/A               0                     7m1s
instance-manager-r-68329745   1               N/A               0                     7m7s
instance-manager-r-9a9ff2dd   1               N/A               0                     7m12s

In conjunction with kube-annotate, this allows CA to remove the nodes hosting these pods when no other workloads require the capacity:

I0127 22:30:29.675899       1 cluster.go:93] Fast evaluation: ip-10-xxx-xx-xxx.eu-west-1.compute.internal for removal
I0127 22:30:29.676532       1 cluster.go:124] Fast evaluation: node ip-10-xxx-xx-xxx.eu-west-1.compute.internal may be removed

Will fork and raise a PR for this tomorrow. In the meantime, it is possible to make that change, build the instance-manager image and use it instead of the published one.

drewhemm

drewhemm commented on Jan 27, 2021

@drewhemm
Author

PR: longhorn/longhorn-manager#823

I also found where the annotations are defined for the instance-manager pods:

https://github.com/longhorn/longhorn-manager/blob/f77e4949cf16f72adaf885fc1c89cf1eeb6a1788/controller/instance_manager_controller.go#L818

Annotations:     map[string]string{types.GetLonghornLabelKey(types.LastAppliedTolerationAnnotationKeySuffix): string(tolerationsByte)},

Would it be acceptable to add the CA annotation here for compatibility, rather than using kube-annotate? Only for the instance-manager-e pods, naturally.

75 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

area/cloudPublic cloud support relatedarea/install-uninstall-upgradeInstall, Uninstall or Upgrade relatedcomponent/longhorn-managerLonghorn manager (control plane)experimentalAlphahighlightImportant feature/issue to highlightkind/featureFeature request, new featurepriority/0Must be implement or fixed in this release (managed by PO)

Type

No type

Projects

  • Status

    Resolved
  • Status

    Closed

Relationships

None yet

Development

No branches or pull requests

Issue actions

    [FEATURE] instance-manager compatibility with Cluster Autoscaler · Issue #2203 · longhorn/longhorn