enhancement: Add LEP 'support Data Locality option to keep a replica …

…local to the engine' longhorn#1045 Signed-off-by: Phan Le <phan.le@rancher.com>
nishakm · Sep 8, 2020 · 93c0d6a · 93c0d6a
1 parent 0fe8cd6
commit 93c0d6a
Showing 1 changed file with 244 additions and 0 deletions.
diff --git a/enhancements/20200819-keep-a-local-replica-to-engine.md b/enhancements/20200819-keep-a-local-replica-to-engine.md
@@ -0,0 +1,244 @@
+# Data Locality - Option To Keep A Local Replica On The Same Node As The Engine
+
+## Summary
+
+A Longhorn volume can be backed by replicas on some nodes in the cluster and accessed by a pod running on any node in the cluster. 
+In the current implementation of Longhorn, the pod which uses Longhorn volume could be on a node that doesn't contain any replica of the volume. 
+In some cases, it is desired to have a local replica on the same node as the consuming pod. 
+In this document, we refer to the property of having a local replica as having `data locality`
+
+This enhancement gives the users option to have a local replica on the same node as the engine which means on the same node as the consuming pod. 
+
+### Related Issues
+
+https://github.com/longhorn/longhorn/issues/1045
+
+## Motivation
+
+### Goals
+
+Provide users an option to try to migrate a replica to the same node as the consuming pod.
+
+### Non-goals
+
+Another approach to achieve data locality is trying to influence Kubernetes scheduling decision so that pods get scheduled onto the nodes which contain volume's replicas.
+However, this is not a goal in this LEP. See https://github.com/longhorn/longhorn/issues/1045 for more discussion about this approach. 
+
+## Proposal
+
+We give user 2 options for data locality setting: `disabled` and `best-effort`. 
+In `disabled` mode, there may be a local replica of the volume on the same node as the consuming pod or there may not be. 
+Longhorn doesn't do anything.
+In `best-effort` mode, if a volume is attached to a node that has no replica, the Volume Controller will start rebuilding the replica on the node after the volume is attached. 
+Once the rebuilding process is done, it will remove one of the other replicas to keep the replica count as specified.
+
+### User Stories
+
+Sometimes, having `data locality` is critical. 
+For example, when the network is bad or the node is temporarily disconnected, having local replica will keep the consuming pod running. 
+Another case is that sometimes the application workload can do replication itself (e.g. database) and it wants to have a volume of 1 replica for each pod. 
+Without the `data locality` feature, multiple replicas may end up on the same node which destroys the replication intention of the workload. See more in [Story 1](#story-2)
+
+In the current implementation of Longhorn, the users cannot ensure that pod will have a local replica. 
+After the enhancement implemented, users can have options to choose among `disabled` (default setting) or `best-effort`
+
+#### Story 1
+
+A user has three hyper-converged nodes and default settings with: `default-replica-count: 2`. 
+He wants to ensure a pod always runs with at least one local replica would reduce the amount of network traffic needed to keep the data in sync. 
+There does not appear to be an obvious way for him to schedule the pod using affinities.
+
+#### Story 2
+
+A user runs a database application that can do replication itself. 
+The database app creates multiple pods and each pod uses a Longhorn volume with `replica-count = 1`. 
+The database application knows how to schedule pods into different nodes so that they achieve HA. 
+The problem is that replicas of multiple volumes could land on the same node which destroys the HA capability. 
+With the `data locality` feature we can ensure that replicas are on the same nodes with the consuming pods and therefore they are on different nodes.
+
+### User Experience In Detail
+
+* Users create a new volume using Longhorn UI with `dataLocality` set to `best-effort`. 
+* If users attach the volume a node which doesn't contain any replica, they will see that Longhorn migrate a local replica to the node.
+* Users create a storageclass with dataLocality: best-effort set
+* Users launch a statefulset with the storageclass.
+* Users will find that there is always a replica on the node where the pod resides on
+* Users update  `dataLocality` to `disable`, detach the volume, and attach it to a node which doesn't have any replica
+* Users will see that Longhorn does not create a local replica on the new node.
+
+### API changes
+
+There are 2 API changes:
+1. When creating a new volume, the body of the request sent to `/v1/volumes` has a new field `dataLocality` set to either `disabled` or `best-effort`.
+1. Implement a new API for users to update `dataLocality` setting for individual volume.
+The new API could be `/v1/volumes/<VOLUME_NAME>?action=updateDataLocality`. This API expects the request's body to have the form `{dataLocality:<DATA_LOCALITY_MODE>}`.
+
+## Design
+
+### Implementation Overview
+There are 2 modes for `dataLocality`:
+1. `disabled` is the default mode. 
+   There may be a local replica of the volume on the same node as the consuming pod or there may not be. 
+   Longhorn doesn't do anything.
+1. `best-effort` mode instructs Longhorn to try to keep a local replica on the same node as the consuming pod. 
+   If Longhorn cannot keep the local replica (due to not having enough disk space, incompatible disk tags, etc...), Longhorn does not stop the volume.
+
+There are 3 settings the user can change for `data locality`:
+
+1. Global default setting inside Longhorn UI settings.
+   The global setting should only function as a default value, like replica count. 
+   It doesn't change any existing volume's setting
+1. specify `dataLocality` mode for individual volume upon creation using UI
+1. specify `dataLocality` mode as a parameter on Storage Class.
+
+Implementation steps:
+
+1. Add a global setting `DefaultDataLocality`
+1. Add the new field `DataLocality` to `VolumeSpec`
+1. Modify the volume creation API so that it extracts, verifies, and sets the `dataLocality` mode for the new volume.
+If the volume creation request doesn't have `dataLocality` field inside its body, we use the `DefaultDataLocality` for the new volume.
+1. Modify the `CreateVolume` function inside the CSI package so that it extracts, verifies, and sets the `dataLocality` mode for the new volume.
+This makes sure that Kubernetes can use CSI to create Longhorn volume with a specified `datLocality` mode.
+1. Inside `volume controller`'s sync logic, we add a new function `ReconcileLocalReplica`.
+1. When a volume enters the `volume controller`'s sync logic, function `ReconcileLocalReplica` checks the `dataLocality` mode of the volume.
+   If the `dataLocality` is `disabled`, it will do nothing and return.
+1. If the `dataLocality` is `best-effort`, `ReconcileLocalReplica` checks whether there is a local replica on the same node as the volume.
+   1. If there is no local replica, we create an in-memory replica struct.
+      We don't create a replica in DS using createReplica() directly because we may need to delete the new replica if it fails to ScheduleReplicaToNode.
+   	  This prevents UI from repeatedly show creating/deleting the new replica.
+      Then we try to schedule the replica struct onto the consuming pod's node.
+      If the scheduling fails, we don't do anything. The replica struct will be collected by Go's garbage collector.
+      If the scheduling success, we save the replica struct to the data store. This will trigger replica rebuilding on the consuming pod's node.
+   1. If there already exists a local replica on the consuming pod's node, we check to see if there are more healthy replica than specified on the volume's spec.
+      If there are more healthy replicas than specified on the volume's spec, we remove a replica on the other nodes. 
+      We prefer to delete replicas on the same disk, then replicas on the same node, then replicas on the same zone.
+
+UI modification:
+1. On volume creation, add an input field for `dataLocality`
+1. On volume detail page:
+   * On the right volume info panel, add a <div> to display `selectedVolume.dataLocality`
+   * On the right volume panel, in the Health row, add an icon for data locality status. 
+     Specifically, if  `dataLocality=best-effort` but there is not a local replica then display a warning icon.
+     Similar to the replica node redundancy wanring [here](https://github.com/longhorn/longhorn-ui/blob/0a52c1f0bef172d8ececdf4e1e953bfe78c86f29/src/routes/volume/detail/VolumeInfo.js#L47)
+   * In the volume's actions dropdown, add a new action to update `dataLocality`
+1. In Rancher UI, add a parameter `dataLocality` when create storage class using Longhorn provisioner. 
+
+### Test plan
+
+#### Manually Test Plan
+1. Create a cluster of 9 worker nodes and install Longhorn. 
+Having more nodes helps us to be more confident because the chance of randomly scheduling a replica onto the same node as the engine is small.
+
+##### Test volume creation with `dataLocality` is `best-effort`:
+
+1. Create volume `testvol` with `Number of Replicas = 2` and `dataLocality` is `best-effort`
+1. Attach `testvol` to a node that doesn't contain any replica.
+1. Verify that Longhorn schedules a local replica to the same node as the consuming pod. 
+   After finishing rebuilding the local replica. Longhorn removes a replica on other nodes to keep the number of replicas is 2.
+
+##### Test volume creation with `dataLocality` is `disabled`:
+
+1. Create another volume, `testvol2`  with `Number of Replicas = 2` and `dataLocality` is `disabled`
+1. Attach `testvol2` to a node that doesn't contain any replica.
+1. Verify that Longhorn doesn't move replica
+
+##### Test volume creation with `dataLocality` is unspecified and `DefaultDataLocality` setting as `disabled`:
+1. Leave the `DefaultDataLocality` setting as `disabled` in Longhorn UI.
+1. Create another volume, `testvol3`  with `Number of Replicas = 2` and `dataLocality` is empty
+1. Attach `testvol3` to a node that doesn't contain any replica.
+1. Verify that the `dataLocality` of `testvol3` is `disabled` and that Longhorn doesn't move replica.
+
+##### Test volume creation with `dataLocality` is unspecified and `DefaultDataLocality` setting as `best-effort`:
+1. Set the `DefaultDataLocality` setting to `best-effort` in Longhorn UI.
+1. Create another volume, `testvol4`  with `Number of Replicas = 2` and `dataLocality` is empty
+1. Attach `testvol4` to a node that doesn't contain any replica.
+1. Verify that the `dataLocality` of `testvol4` is `best-effort`.
+1. Verify that Longhorn schedules a local replica to the same node as the consuming pod. 
+   After finishing rebuilding the local replica. 
+   Longhorn removes a replica on other nodes to keep the number of replicas is 2.
+
+##### Test `updateDataLocality` from `disabled` to `best-effort`:
+1. Change `dataLocality` to `best-effort` for `testvol2`
+1. Verify that Longhorn schedules a local replica to the same node as the consuming pod. 
+   After finishing rebuilding the local replica. 
+   Longhorn removes a replica on other nodes to keep the number of replicas which is 2.
+
+##### Test `updateDataLocality` from `best-effort` to `disabled` :
+1. Change `dataLocality` to `disabled` for `testvol2`
+1. Go to Longhorn UI, increase the `number of replicas` to 3. Wait until the new replica finishes rebuilding.
+1. Delete the local replica on the same node as the consuming pod.
+1. Verify that Longhorn doesn't move replica 
+
+##### Test volume creation by using storage class with `dataLocality` parameter is `disabled`:
+1. Create `disabled-longhorn` storage class with from this yaml file:
+    ```yaml
+    kind: StorageClass
+    apiVersion: storage.k8s.io/v1
+    metadata:
+      name: disabled-longhorn
+    provisioner: driver.longhorn.io
+    allowVolumeExpansion: true
+    parameters:
+      numberOfReplicas: "1"
+      dataLocality: "disabled"
+      staleReplicaTimeout: "2880" # 48 hours in minutes
+      fromBackup: ""
+    ```
+1. create a deployment of 1 pod using PVC dynamically created by `disabled-longhorn` storage class.
+1. The consuming pod is likely scheduled onto a different node than the replica. 
+   If this happens, verify that Longhorn doesn't move replica
+
+##### Test volume creation by using storage class with `dataLocality` parameter is `best-effort`:
+1. Create `best-effort-longhorn` storage class with from this yaml file:
+    ```yaml
+    kind: StorageClass
+    apiVersion: storage.k8s.io/v1
+    metadata:
+      name: best-effort-longhorn
+    provisioner: driver.longhorn.io
+    allowVolumeExpansion: true
+    parameters:
+      numberOfReplicas: "1"
+      dataLocality: "best-effort"
+      staleReplicaTimeout: "2880" # 48 hours in minutes
+      fromBackup: ""
+    ```
+1. create a shell deployment of 1 pod using the PVC dynamically created by `best-effort-longhorn` storage class.
+1. The consuming pod is likely scheduled onto a different node than the replica. 
+1. If this happens, verify that Longhorn schedules a local replica to the same node as the consuming pod. 
+   After finishing rebuilding the local replica, Longhorn removes a replica on other nodes to keep the number of replicas which is 1.
+1. verify that the volume CRD has `dataLocality` is `best-effort`
+
+##### Test volume creation by using storage class with `dataLocality` parameter is unspecified`:
+1. Create `unspecified-longhorn` storage class with from this yaml file:
+    ```yaml
+    kind: StorageClass
+    apiVersion: storage.k8s.io/v1
+    metadata:
+      name: unspecified-longhorn
+    provisioner: driver.longhorn.io
+    allowVolumeExpansion: true
+    parameters:
+      numberOfReplicas: "1"
+      staleReplicaTimeout: "2880" # 48 hours in minutes
+      fromBackup: ""
+    ```
+1. create a shell deployment of 1 pod using PVC dynamically created by `unspecified-longhorn` storage class.
+1. The consuming pod is likely scheduled onto a different node than the replica. 
+1. If this happens, depend on `DefaultDataLocality` setting in Longhorn UI, verify that Longhorn does/doesn't migrate a local replica to the same node as the consuming pod.
+
+#####  Tests for the volumes created in old versions:
+
+1. The volumes created in old Longhorn versions don't have the field `dataLocality`. 
+1. We treat those volumes the same as having `dataLocality` set to `disabled`
+1. Verify that Longhorn doesn't migrate replicas for those volumes. 
+
+### Upgrade strategy
+
+No special upgrade strategy is required.
+
+We are adding the new field, `dataLocality`, to volume CRD's spec.
+Then we use this field to check whether we need to migrate a replica to the same node as the consuming pod.
+When users upgrade Longhorn to this new version, it is possible that some volumes don't have this field.
+This is not a problem because we only migrate replica when `dataLocality` is `best-effort`.
+So, the empty `dataLocality` field is fine.