Description
Is your feature request related to a problem? Please describe.
When using Cluster Autoscaler, nodes are not removed from the cluster if they have bare pods; an annotation can be added to pods to let CA know it is safe to evict them:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
A node that has no other workloads running should be a candidate for removal from the cluster, but alas:
I0121 16:44:11.071945 1 cluster.go:93] Fast evaluation: ip-10-xxx-xx-xxx.eu-west-1.compute.internal for removal
I0121 16:44:11.071966 1 cluster.go:107] Fast evaluation: node ip-10-xxx-xx-xxx.eu-west-1.compute.internal cannot be removed: longhorn-system/instance-manager-e-1c19a568 is not replicated
Describe the solution you'd like
The option to specify annotations to be added to instance-manager pods would solve this issue. When installing Longhorn via Helm, there is the option to provide annotations for the ingress in the values.yaml
- similar functionality would be desirable.
Describe alternatives you've considered
A mutating webhook might be an option, but that is a lot of work.
Additional context
Without this functionality, it is not possible to use Longhorn in conjuction with Cluster Autoscaler as it CA will be blocked from removing any nodes from the cluster.
Metadata
Metadata
Labels
Type
Projects
Status
Resolved
Status
Closed
Activity
drewhemm commentedon Jan 21, 2021
I tried to find the source code for where the pod spec is defined, but could not find it.
joshimoo commentedon Jan 21, 2021
If you want to dynamically remove nodes from the cluster you want to use dedicated storage nodes, otherwise you risk removing the last replica of your data.
https://longhorn.io/docs/1.1.0/volumes-and-nodes/storage-tags/
https://longhorn.io/docs/1.1.0/advanced-resources/default-disk-and-node-config/
We implement a poddisruption budget to prevent eviction of the last healthy replica. To prevent a user from doing just that.
We cannot add the above since it's not safe to delete the nodes unless we know there are other nodes with a healthy replica of the data.
drewhemm commentedon Jan 21, 2021
Thanks for the quick reply. The problem is not the nodes being used to host the replicas, it's all the other nodes in the cluster - every one gets a longhorn pod, which I assume is to enable them to mount the volumes for the pods that they are hosting.
instance-manager-e-*
pods are the problem, not theinstance-manager-r-*
ones.My 'longhorn' nodes are static, but I want every other node to be able to dynamically come and go when needed.
munjalpatel commentedon Jan 22, 2021
I agree with @drewhemm . I am facing the same exact problem in my AKS since I introduced longhorn to the cluster a week ago.
After spending an entire day troubleshooting the issue "why autoscaler does not mark a node for deletion that has nothing but
instance-manager-e-*
andinstance-manager-r-*
pods?" we were able to pinpoint the root cause to be these two longhorn pods from the following autoscaler logs:drewhemm commentedon Jan 22, 2021
I have come across a project called kube-annotate, which may provide a short-term solution. It uses a mutating admission controller to perform rule-based annotation of pods; I should be able to enforce annotation of the
instance-manager-e-*
pods with that. Would prefer if this functionality were available without the use of a third-party app though.Is there a particular reason bare pods are used instead of a DaemonSet?
munjalpatel commentedon Jan 26, 2021
Folks, I would really appreciate some help here. We really want (need) to use
longhorn
.However, this issue is significant since it completely prevents autoscaler to scale down.
We have a large cluster where proper functioning of autoscaler is vital for cost optimization and has significant material impact on business.
I am fairly sure as more folks start utilizing amazing powers of longhorn, they will very likely run into this issue as well.
joshimoo commentedon Jan 27, 2021
@munjalpatel please ensure you use always available dedicated storage nodes, if you plan on dynamically adding/removing nodes from your cluster. See for how to set this up: #2203 (comment)
For the moment you can use the workaround posted by @drewhemm here: #2203 (comment)
To allow autoscaler to remove the nodes, please do not remove
instance-manager-r-*
pods since that will potential lead to data loss, therefore it's important to use always available dedicated storage nodes for this type of setup.drewhemm commentedon Jan 27, 2021
Hi @munjalpatel, I am working on a fork of kube-annotate, which we will use to workaround this issue:
https://github.com/footprintmediaits/kube-annotate/tree/refactor-k8s-1.18
Will post back here once it is ready for use. There were some issues in the original repo, plus the code was outdated and not working with more recent versions of k8s.
drewhemm commentedon Jan 27, 2021
Well this is highly frustrating! We have got
kube-annotate
working, so that it adds the necessary annotation to theinstance-manager-e
pods:...but now we hit the pod disruption budget - there is one created for each pod:
I must ask again: why are these deployed as bare pods instead of using a daemonset?
drewhemm commentedon Jan 27, 2021
Okay, I managed to resolve this. I modified the generateInstanceManagerPDBManifest function to set the
minAvailable
for theinstance-manager-e
pods to 0:In conjunction with kube-annotate, this allows CA to remove the nodes hosting these pods when no other workloads require the capacity:
Will fork and raise a PR for this tomorrow. In the meantime, it is possible to make that change, build the instance-manager image and use it instead of the published one.
drewhemm commentedon Jan 27, 2021
PR: longhorn/longhorn-manager#823
I also found where the annotations are defined for the instance-manager pods:
https://github.com/longhorn/longhorn-manager/blob/f77e4949cf16f72adaf885fc1c89cf1eeb6a1788/controller/instance_manager_controller.go#L818
Would it be acceptable to add the CA annotation here for compatibility, rather than using kube-annotate? Only for the
instance-manager-e
pods, naturally.75 remaining items