[IMPROVEMENT] labeling nodes hosting replicas to enable podAffinity #9741

dberardo-com · 2024-10-31T09:08:16Z

dberardo-com
Oct 31, 2024

i am using a volume with 2 replicas on a 3 node cluster, with best-effort data locality startegy and read write once policy.

there is one pod using the volume, which needs fast read and write to disk so that's why the "best-effort" locality.

this pod is currently "free-to-move" among all cluster's node, but because the underlying volume has a lot of data, it would be desirable the have the pod constrained to just the 2 nodes hosting the volume replicas, otherwise if the pod starts on a node that does not have a replica, a huge data transfer process needs to start.

i thought of using podAffinity on the Pod definition for this, but how can i target the nodes hosting the replicas, is there any label i can use?

DamiaSan · 2024-10-31T09:34:26Z

DamiaSan
Oct 31, 2024
Collaborator

Have a look here https://longhorn.io/docs/1.7.2/advanced-resources/deploy/node-selector/

2 replies

dberardo-com Nov 2, 2024
Author

Thanks for the reply, but this is not solving my use case.

So the issue here is not assigning replicas to nodes, but assigning pods (preference) to nodes that have replicas on them, to avoid rebuilding the replicas on other nodes if not necessary.

Please let me know if I have not 'mmade myself clear with my explanation above and I will rephrase the description.

dberardo-com Nov 30, 2024
Author

I hope that makes my point more clear? Or should I describe it Better?

DamiaSan · 2024-12-02T07:32:45Z

DamiaSan
Dec 2, 2024
Collaborator

cc @ChanYiLin

0 replies

ChanYiLin · 2024-12-05T10:14:39Z

ChanYiLin
Dec 5, 2024
Collaborator

@dberardo-com
Lets say you have 3 nodes
How about

Set the tag, for example, special on the 2 Nodes in Longhorn
Create the volume with the "Node Tags": "special", so the replicas will be only created on these 2 Nodes
Then use kubectl to set the label to the nodes with "tag"="special"
Create the pod with the nodeSelector: {"tag": "ssd"}}

The concept here is to

force the replicas to some set of the node
force the pods to same set of the node

We can only help on the first one,
but for second one it is outside the scope of Longhorn, above is just one possible way.
You can search how to force pod to be created on some nodes.

8 replies

dberardo-com Dec 6, 2024
Author

They only provide node-affinity/taint mechanism for you to place the pods on the nodes you want

these are techniques i know of, but i am not sure how k8s handles pod <-> volume stickiness contraint (affinity?) for regular volumes, think hostPath, or local provisioner in k3s, etc. I was hoping there was some advanced hidden mechanism that could be used to force pods on the same nodes as a volume

Again, Longhorn cant control users pods and it shouldn't.

i understand, but cloud providers charge big money for moving data around between data nodes, so this would make quite an impact on budgets (and performance too)

Adding labels to the node and adding node-selector or node-affinity to your pods is the most standard solution to this and is encouraged by the K8s official docs.

i see ... is there any pod in the longhorn-system namespace that one could make use of in order to create a podAffinity for the workloads ? Is there any pod, like the instanceManager for example that can be used to trigger the desired mechanism? I have not checked what happens if a node goes down whose volume was attached to the workload. is the instanceManager spawning on a new node BEFORE the workload is scheduled or does it happen AFTER ? if it happens before, then perhaps one could use the instanceMAnager as the affinity source

ChanYiLin Dec 6, 2024
Collaborator

In short, "k8s handles pod <-> volume stickiness contraint (affinity?)" doesn't exist in K8s

Pod is pod, managed by K8s and you
Volume is volume, managed by the third party, e.g. Longhorn, Ceph, GPC, AWS,...

As far as I know, hostPath only allows you to mount the path in the host to the pod on the same node. But it doesn't include forcing your pod to the node where you are going to provide the host path.
You still need to use node-selector to force the pod to the node you want, am I right?

For local-path-provisioner, it also don't include the user pods scheduling.
It only allows you to pre-defined the path in the host that you want it to be a hostPath
Then when the pod using that PVC is scheduled to one of the node, it will mount the pre-defined hostPath.
For example, if you have this config in the local-path-provisioner

  config.json: |-
        {
                "nodePathMap":[
                {
                        "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                        "paths":["/opt/local-path-provisioner"]
                },
                {
                        "node":"node1",
                        "paths":["/node1", "/data1"]
                },
                {
                        "node":"node2",
                        "paths":["/node2", "data2"]
                }
                ]
        }

If one node is listed on the nodePathMap, the specified paths in paths will be used for provisioning.
If one node is listed but with paths set to [], the provisioner will refuse to provision on this node.
If more than one path was specified, the path would be chosen randomly when provisioning.

If your data is on node2 "/node2", you still need to specify the pod to be scheduled to the node2, so it can access the data in the path "/node2" on node2.

ChanYiLin Dec 6, 2024
Collaborator

Instance-Manager is like a daemon-set which will be created on every nodes to manage the replica/engine processes on it.
When the node goes down, the instance-manager on that node won't be created on another node since it is 1:1 mapping.
In this case, replica on that node(instance-manager) goes down, the volume still works because it use network protocal and can access another replica data on another node(instance-manager).
If the node also contains YOUR pod, then your pod will be scheduled to another node BY K8S (if yours is a deployment) and Longhorn will remount the volume to the new recreated pod and connect to other replicas.
That is the fundamental difference between using hostPath or local-path-provisioner. The whole IO system built on top of the network protocol allows the next-level resilience.

ChanYiLin Dec 6, 2024
Collaborator

Again, volume scheduling is managed by Longhorn.
But pod scheduling is managed by K8s and You.

ChanYiLin Dec 6, 2024
Collaborator

Moreover, volume is just a concept
For Longhorn, it is an iscsi initiator and target between engine and replicas on different node.
The IO from your pod goes through the engine and is dispatched to all the replicas and even wait until all replicas ack to ensure data consistency.
You cant say you want you pod to be as close as the volume to shorten the data path as long as you have extra replica on other nodes.

For OpenStack, the volume is a logical volume on different node as your VM, and it also uses the iscsi to dispatch the IO.
For Ceph, the volume is a on the RADOS layer developed by themselves which is a distributed storage system, your volume is divided by multiple blocks and stored by sharding on different nodes in the cluster. There is of course no pod-volume affinity in this case.

So, if IO performance is really a key metric for your application.
I would suggest you should looks more into the storage solution you are going to use, see how the IO works.
And do the real benchmark to see which one you can tolerate.

Resilience, replication and fault-tolerance is a tradeoff of the IO performance because you need to add more layer into the path.

dberardo-com · 2024-12-06T12:41:30Z

dberardo-com
Dec 6, 2024
Author

Instance-Manager is like a daemon-set which will be created on every nodes to manage the replica/engine processes on it.
When the node goes down, the instance-manager on that node won't be created on another node since it is 1:1 mapping.

I see ... so there is no pod in longhorn-system which is specific to a volume, meaning if i have 200 volumes there would be 200 pods of that specific kind. so no, we can't use such kind of pod as a target of pod affinity then.

Does perhaps longhorn annotate CRDs or change CRD statuses when replicas are scheduled ? I am not sure if this could be one hint in the direction of a custom solution.

You cant say you want you pod to be as close as the volume to shorten the data path as long as you have extra replica on other nodes.

well in general yes, but i am being very pragmatic here, since i know the volume is scheduled by longhorn then i know i can benefit from data-locality if the replica exists on the same node as the pod ... and i want to make use of this fact.

Anyway, i thank you for your explanation, i just wanted to make sure that my use case was not possible to achieve with the current features of longhorn and you confirmed it. I understand there is the need of custom solution to achieve what i am looking for, but i am not sure which part of k8s i should be addressing for it.

My initial bet would be using the descheduler (https://github.com/kubernetes-sigs/descheduler) or perhaps a custom controller. But since i want this solution to be generic and not related to any CRD perhaps controllers aren't the way to go. Is there any other k8s component i can make use of? is perhaps the admission hook something to look into ? any suggestion here?

this contribution by one github user goes in this direction: #5486 (reply in thread)

21 replies

ghost Dec 11, 2024

@dberardo-com If you can wait a little while I can build something for you

we should possibly make the node-labeling behavior optional for every volume, as when one has lot of volumes and replicas this might end up filling up nodes with lot of labels, which might not be a best practice

There's no need to modify or label any resources. This can be done on the fly with the mutation webhook

ghost Jan 3, 2025

@dberardo-com Did you still need this? I found an even simpler way of doing this with new additions to Kubernetes since 1.29 and custom scheduler

dberardo-com Jan 3, 2025
Author

Sounds great, sure, I still wish to test some solutions within the next weeks so this could be helpful

ghost Jan 4, 2025

New method is pretty much:

On pod definition name a custom scheduler schedulerName: pvScheduler.
That scheduler needs read access to PV bindings.
It can tell what PVC correlates to which PV because of the pod definition and information in the PV longhorn creates

It will leave the pod in pending state until the PVs are created. When they are finish it knows which nodes they were bound to. The scheduler then files a binding resource to bind pod to one of the nodes with PV.

Everytime the pod needs to be rescheduled it will use this scheduler. It will be able to do this stateless

dberardo-com Jan 9, 2025
Author

i see, i am not familiar with this concept either. do we need to create a custom scheduler ? is this a global or a namespaced resource? and how would the definition look like ?

and other then that: would then suffice to add the schedulerName attribute to all deployments/statefulSets requiring and thats it ? no need to modify also storageclasses or PVC ?

Also: why is k8s v 1.29 needed for that ? which feature is the new one ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] labeling nodes hosting replicas to enable podAffinity #9741

{{title}}

Replies: 4 comments 31 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[IMPROVEMENT] labeling nodes hosting replicas to enable podAffinity #9741

dberardo-com Oct 31, 2024

Replies: 4 comments · 31 replies

DamiaSan Oct 31, 2024 Collaborator

dberardo-com Nov 2, 2024 Author

dberardo-com Nov 30, 2024 Author

DamiaSan Dec 2, 2024 Collaborator

ChanYiLin Dec 5, 2024 Collaborator

dberardo-com Dec 6, 2024 Author

ChanYiLin Dec 6, 2024 Collaborator

ChanYiLin Dec 6, 2024 Collaborator

ChanYiLin Dec 6, 2024 Collaborator

ChanYiLin Dec 6, 2024 Collaborator

dberardo-com Dec 6, 2024 Author

ghost Dec 11, 2024

ghost Jan 3, 2025

dberardo-com Jan 3, 2025 Author

ghost Jan 4, 2025

dberardo-com Jan 9, 2025 Author

dberardo-com
Oct 31, 2024

Replies: 4 comments 31 replies

DamiaSan
Oct 31, 2024
Collaborator

dberardo-com Nov 2, 2024
Author

dberardo-com Nov 30, 2024
Author

DamiaSan
Dec 2, 2024
Collaborator

ChanYiLin
Dec 5, 2024
Collaborator

dberardo-com Dec 6, 2024
Author

ChanYiLin Dec 6, 2024
Collaborator

ChanYiLin Dec 6, 2024
Collaborator

ChanYiLin Dec 6, 2024
Collaborator

ChanYiLin Dec 6, 2024
Collaborator

dberardo-com
Dec 6, 2024
Author

dberardo-com Jan 3, 2025
Author

dberardo-com Jan 9, 2025
Author