Skip to content

Commit

Permalink
Merge pull request kubevirt#4663 from brybacki/document-local-storage-vm
Browse files Browse the repository at this point in the history
Document new kubevirt handling of WaitForFirstConsumer DataVolumes
  • Loading branch information
kubevirt-bot authored Jan 8, 2021
2 parents 597accc + 35decbc commit 21d4617
Showing 1 changed file with 51 additions and 0 deletions.
51 changes: 51 additions & 0 deletions docs/localstorage-disks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Local Storage Placement for VM Disks

This document describes a special handling of `DataVolumes` in the `WaitForFirstConsumer` state (`WaitForFirstConsumer` state is available from [CDI v1.21.0](https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.21.0)).

## Use-case

When the `Virtual Machine` has a `DataVolume` disk (or disks) then bind Local Storage `PVC` to a `PV` on the same `node` where the `VMI` is going to be scheduled.

## The problem

Virtual Machines are able to have a DataVolume disks that are based on Local Storage PVs. Local Storage PVs are bound to a specific node.
Since DataVolumes involve preparing storage with an image before being consumed by the VMI,
it's possible to result in an Unschedulable VMI in the event that a VMI can not be scheduled to the node the local storage PV was previously pinned to.

When the VM with a DataVolumeTemplate is defined a DataVolume is created from the template and the `CDI` creates a worker Pod to import/upload/clone data to the PVC (specified in a template).
To run a VMI kubevirt creates a virtlauncher pod with all the VMI requirements. Kubernetes uses the virtlauncher pod requirements to schedule it on a specific node.
Worker Pod might have different constraints than a kubevirt VM. When the VM is scheduled on a different node than the PVC it becomes unusable.
This is especially problematic when using a VM with DataVolumeTemplate with many disks managed by CDI.

## The solution

The solution is to leverage Kubernetes pod scheduler to bind the PVC to a PV on a correct node.
By using a StorageClass with `volumeBindingMode` set to `WaitForFirstConsumer` the binding and provisioning of PV is delayed until a Pod using the PersistentVolumeClaim is created.
Kubevirt can schedule a special ephemeral pod that becomes a first consumer of the PersistentVolumeClaim.
Its only purpose is to be scheduled to a node capable of running VM and by using PVCs to trigger kubernetes to provision and bind PV's on the same node.
After PVC are bound the `CDI` can do its work and Kubevirt can start the actual VM.

## Implementation

### Flow

1. A StorageClass with volumeBindingMode=WaitForFirstConsumer is created
2. User creates the VM with DataVolumeTemplate containing
3. `Kubevirt` creates DataVolume
4. The `CDI` sees that new DV has unbound PVC with storage class with volumeBindingMode=WaitForFirstConsumer, sets the phase of DV to `WaitForFirstConsumer` and waits for PVC to be bound by some external action.
5. `Kubevirt` sees the DV in phase `WaitForFirstConsumer`, so it creates an ephemeral pod (basically a virtlauncher pod
without a VM payload and with `kubevirt.io/ephemeral-provisioning` annotation) only used to force PV provisioning
6. Kubernetes schedules the ephemeral pod, (the node selected meets all the VM requirements), pod requires
the same PVC as the VM so kubenertes has to provision and bind the PV to PVC on a correct node before the pod can be started
7. `CDI` sees that PVC is Bound, changes DV status to "ImportScheduled" (or clone/upload), and tries to start worker pods
8. `Kubevirt` sees DV status is `ImportScheduled`, it can terminate the ephemeral provisioning pod
8. `CDI` does the Import, marks DV as `Succeeded`
9. `Kubevirt` creates the virtlauncher pod to start a VM

This flow differs from standard scenario (import/upload/clone on storage with Immediate binding) by steps 4, 5, 6 and 8.

Note:
`WaitForFirstConsumer` state for DataVolumes is available in CDI from v1.21.0 and toggled by a `HonorWaitForFirstConsumer` feature gate.
When the `HonorWaitForFirstConsumer` feature gate is enabled, the `CDI` is not starting any worker pods when the PVCs StorageClass binding mode is `WaitForFirstConsumer`. In such case the `CDI` puts the DataVolume in a new state `WaitForFirstConsumer`.
More in CDI docs [here](https://github.com/kubevirt/containerized-data-importer/blob/master/doc/waitforfirstconsumer-storage-handling.md).

0 comments on commit 21d4617

Please sign in to comment.