Detailed Design for Volume Mount/Unmount Redesign #21931

saad-ali · 2016-02-24T23:21:24Z

Objective

Rapid pod creation, deletion, and recreation for pods that require volume attaching/detaching/mounting/unmounting should never fail because of attach/detach/mount/unmount issues.
Volumes that are attached/mounted should be gracefully handled on kubelet restart.
- Issue GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953
Two or more pods scheduled to the same node with the same volume should never fail (as long as it is allowed by the volume plugin’s AccessModes policy).
- Issue GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953
When a volume is attached to the node in read-only mode, and the pod referencing it is deleted and another pod referencing the volume is quickly created in read-write mode, then the volume should detached and reattached in the correct (read-write) mode.
- Issue Come up with a clear story for attaching/detaching PD semantics #11879
One or more pods referencing different partitions of the same volume should not fail (as long as it doesn’t violate volume plugin’s AccessModes policy).
- Issue Should multiple partitions of a single GCE disk be able to be RW mounted to a Pod #20835

Background

In the existing Kubernetes design the kubelet is responsible for determining what volumes to attach/mount and detach/unmount from the node it is running on.

The loop in Kubelet that is responsible for attaching and mounting volumes (the pod creation loop) is separate from and completely independent (on a separate thread) of the loop that is responsible for unmounting and detaching volumes (orphaned volumes loop). This leads to race conditions between the asynchronous pod creation and orphaned volumes loops.

Although there is some logic in the GCE PD, AWS, and Cinder plugins to make sure that the actual attach/detach operations don’t interrupt each other, there is no guarantee as to the order of the operations themselves. So, for example, when a pod is created and then rapidly deleted and recreated, kubelet attaches and mounts the pod, then (if the second attach operation is triggered before the detach operation, which often happens) kubelet will execute the second attach operation successfully (since the disk is already attached), and the pending detach operation will then result in a disk-in-use being unmounted (which appears like data loss to user).

To mask this behavior, kubelet currently fails attach operations if the disk is already attached. This allows the second attach operation to fail, and the subsequent detach operation to succeed; further retries of the attach operation then succeed.

Although this workaround masks a nasty bug (apparent data loss to the end user), it results in other unwanted behavior (bugs):

When a kubelet is restarted, pods with volumes that are already attached will fail to start, because attempts by kubelet to attach a volume that is already attached continuously fail (since the volume is already attached).
When two or more pods specify the same volume (allowed for some plugins in certain cases), the second pod will fail to start because because the volume attach call for the 2nd pod will continuously fail since the volume is already attached (by the first pod).

The Volume/Attach Detach Controller design (#20262) plans to move the attach/detach logic from kubelet to master, however the kubelet will still be responsible for mounting/unmounting (and, for backwards compatibility reasons, attach/detach in some cases), therefore these issues must be addressed.

Solution Overview

Introduce a new asynchronous loop, called the volume manager loop, in kubelet that handles attach/detach and mount/unmount in a serialized manner. The existing orphaned volumes loop will be removed, and the logic for unmounting/detaching volumes will be moved to the volume manager loop. Similarly, the logic for determining which volumes to mount/attach will be moved from the pod creation loop to the volume manager loop. The pod creation loop will simply poll the new volume manager until its volumes are ready for use (attached and mounted).

Detailed Design

The volume manager will maintain an in-memory cache containing a list of volumes that are required by the node (i.e. volumes that are referenced by pods scheduled to the node that the kubelet is running on). Each of these volumes will, in addition to the volume name, specify if the volume is mounted in read-only mode, and list the pods referencing the volume. This cache defines the state of the world according to the volume manager. This cache must be thread-safe.

On initial startup, the volume manager will read the /var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/ and /var/lib/kubelet/pods/{podID}/volumes/ directories to figure out which volumes were attached and mounted to the node before it went down and pre-populate the in-memory cache.

Primary Control Loop

The volume manager will have a loop that does the following:

Fetch a copy of all pod and mirror pod objects from kubelet via pod manager.
Acquire a lock on the in-memory cache.
Search for new pods by iterating through the fetched pods, and for each pod with a PodPhase Pending check the volumes it defines or references (dereferencing any PersistentVolumeClaims to fetch associated PersistentVolume objects). For each of these volume:
- If the volume is not already in-memory cache (indicating a new volume has been discovered), then:
  - Trigger “attach volume and mount device” logic (detailed in section below) to attach the volume to the node and mount it to the main mount location.
  - A matching volume must also match access mode. For example, if a volume was attached in read-only mode, a new pod referencing the same volume in read-write mode, will be treated as a separate volume (this enables the scenario mentioned in objectives above).
- If the volume is already tracked in memory cache (indicating it is already attached and mounted to main location), then:
  - Trigger “bind mount” logic to mount the volume to pod specific mount location.
Search for terminated/deleted pods by looping through all cached pods (i.e. volume->pods) and trigger “unmount bind mount” logic for the volume(s) defined for that pod, if:
- The cached pod is not present in the list of fetched pods (indicating the pod object was was deleted from the API server or rescheduled).
- The cached pod is present in the list of fetched pods, but the PodPhase is Succeeded or Failed.
Loop through all cached volumes and trigger “unmount device and detach volume” logic (detailed below) for any volumes that are no longer needed (exist in-memory cache but have no pods listed under them, indicating no running pods using the volume).

Attach, detach, mount, and unmount operations can take a long time to complete, so the primary volume manager loop should not block on these operations. Instead the primary loop should spawn new threads for these operations. The number of threads that can be spawned will be capped (possibly using a thread pool) and once the cap is hit, subsequent requests will have to wait for a thread to become available.

To prevent multiple attach/detach or mount/unmount operations on the same volume, the main thread will maintain a table mapping volumes to currently active operations.

The volume name used as the key for this table and in the volume name in-memory cache will be a unique name that includes the plugin name and the unique name the plugin uses to identify the volume--not the volume name specified in the pod spec (because the same volume can be specified under two different pod definitions with different names).

Attach Volume and Mount Device

Spawn a new thread for operation.
- Abort if there are no more threads available, i.e. there are too many pending operations in-flight (the volume manager loop will retry, if needed).
Acquire operation lock for volume so that no other attach/detach/mount/unmount operations can be started for the specified volume.
- Abort if there is already a pending operation for the specified volume (the volume manager loop will retry, if needed).
If attach logic is configured on (default behavior), and the volume type implements the Attacher interface:
- Execute the volume-specific logic to attach the specified volume to the specified node.
  - If there is an error indicating the volume is already attached to the specified node, assume attachment was successful (this will be the responsibility of the plugins).
  - For all other errors, log the error, and terminate the thread (the volume manager loop will retry as needed).
If attach logic is configured off, make a call to the API server to fetch the VolumeStatus object under the PodStatus for the volume to indicate that it is safeToMount.
- If volume does not become attached within a reasonable amount of time, log an error, and terminate the thread (the volume manager loop will retry as needed).
Execute volume-specific logic to verify that volume is attached.
- If volume does not become attached within a reasonable amount of time, log an error, and terminate the thread (the volume manager loop will retry as needed).
Mount volume to main mount location:
- Execute the volume-specific logic to mount the volume to /var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/.
- Acquire a lock on the in-memory cache (block until lock is acquired).
- Add the volume, to in-memory cache, to indicate the volume was successfully attached and mounted to main mount location and set the read-only indicating if it was attached in read-only or read-write mode.
- Release the lock on the in-memory cache.
Release operation lock for volume.

Bind Mount to Pod Specific Location

Spawn a new thread for operation.
- Abort if there are no more threads available, i.e. there are too many pending operations in-flight (the volume manager loop will retry, if needed).
Acquire operation lock for volume so that no other attach/detach/mount/unmount operations can be started for the volume.
- Abort if there is already a pending operation for the specified volume (the volume manager loop will retry, if needed).
Verify that the main mount location exists (/var/lib/kubelet/plugins/kubernetes.io/{plugin}/attached/{uniqueVolumeName}/).
- If it does not, log an error, and terminate the thread (the volume manager loop will retry as needed).
Bind mount the main mount location to the pod specific mount location /var/lib/kubelet/pods/{podID}/volumes/{sanatizedPluginName}/{podSpecVolumeName}/.
Once mounting completes successfully:
- Acquire a lock on the in-memory cache (block until lock is acquired).
- Add the pod, to in-memory cache, under the volume to indicate the volume was successfully bind mounted to the pod specific location.
- Release the lock on the in-memory cache.
Release operation lock for volume.

Unmount Bind Mount from Pod Specific Location

Spawn a new thread for operation.
- Abort if there are no more threads available, i.e. there are too many pending operations in-flight (the volume manager loop will retry, if needed).
Acquire operation lock for volume so that no other attach/detach/mount/unmount operations can be started for the volume.
- Abort if there is already a pending operation for the specified volume (the volume manager loop will retry, if needed).
Unmount bind mount between main mount location from the pod specific mount location.
Once unmounting completes successfully:
- Acquire a lock on the in-memory cache (block until lock is acquired).
- Remove the pod, from in-memory cache, under the volume to indicate the bind mount was successfully unmounted.
- Release the lock on the in-memory cache.
Release operation lock for volume.

Unmount Device and Detach Volume

Spawn a new thread for operation.
- Abort if there are no more threads available, i.e. there are too many pending operations in-flight (the volume manager loop will retry, if needed).
Acquire operation lock for volume so that no other attach/detach/mount/unmount operations can be started for the volume.
- Abort if there is already a pending operation for the specified volume (the volume manager loop will retry, if needed).
Unmount device:
- Execute the volume-specific logic to unmount the volume from /var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/.
If attach logic is configured on (default behavior), and the volume type implements the Detacher interface:
- Execute the volume-specific logic to detach the specified volume from the specified node.
  - If there is an error indicating the volume does not exist or is not attached to the specified node, assume detachment was successful (this will be the responsibility of the plugin code).
  - For all other errors, log the error, and terminate the thread (the volume manager loop will retry as needed).
If attach logic is configured off, make a call to the API server to set the VolumeStatus object under the PodStatus for the volume to indicate that it is safeToDetach.
Execute volume-specific logic to verify that volume is detached.
- If volume does not become detached within a reasonable amount of time, log an error, and terminate the thread (the volume manager loop will retry as needed).
Once the volume detaches successfully:
- Acquire a lock on the in-memory cache (block until lock is acquired).
- Remove the volume from in-memory cache, to indicate the volume was successfully detached and unmounted.
- Release the lock on the in-memory cache.
- Delete the /var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/ directory.
Release operation lock for volume.

Updated February 22, 2016: Fix order of execution of unmount/detach

The text was updated successfully, but these errors were encountered:

jsafrane · 2016-02-25T15:20:45Z

The proposal looks fine to me, only one thing scares me: you must not detach a volume before unmounting it. There may be some unwritten pages and detaching the volume could corrupt the filesystem or application data on it.

Changing order of the operation will have some impact on your design, maybe you can unmount the volume first, then detach and only after that delete kubernetes.io/{plugin}/mounts/{uniqueVolumeName} directory

ghost · 2016-02-25T21:08:25Z

@saad-ali SGTM... I can start working on this unless you are already working on it

saad-ali · 2016-02-26T04:45:07Z

The proposal looks fine to me, only one thing scares me: you must not detach a volume before unmounting it.

@jsafrane Absolutely, that is one of the intentions of the design. I brain farted when I wrote that section. Fixed. Thanks for keen eye.

saad-ali · 2016-02-26T05:32:55Z

@saad-ali SGTM... I can start working on this unless you are already working on it

Sami, go for it. I'll work on #20262 in parallel. There will be overlap between the two. We can coordinate over Slack. If you can carve out smaller PRs, that would be awesome. Feel free to schedule a VC if you want to discuss anything in depth.

ghost · 2016-02-26T14:44:16Z

Sounds good!... I'll try to slice out thinner PRs and we'll coordinate

ghost · 2016-03-17T20:21:49Z

@saad-ali

When a volume is attached to the node in read-only mode, and the pod referencing it is deleted and another pod referencing the volume is quickly created in read-write mode, then the volume should detached and reattached in the correct (read-write) mode.

I have been thinking about this. We discussed yesterday that if we do not use the cache and rely just on the directory structure that we would need add information to the path about read-write vs read-only attach modes to solve the above issue. I am thinking now that the problem, at least as described above, would be fixed by the serialization of mount/unmount/attach/detach operations. That is if a pod has a volume mounted as read-only and is deleted then the detach operation would have to complete before the attach operation for a new pod is starts. If a new pod is scheduled but the old pod has not been deleted yet then the the master will not allow it because of a disk conflict. WDYT ?

This leads to the question of whether it is possible for the MountManager (as we will write it) to see the created pod before the deleted pod ?

thockin · 2016-04-06T16:19:14Z

Just did a full read through. I think this might benefit from the same pseudocode treatment as the binder controller.

Two or more pods scheduled to the same node with the same volume should never
fail (as long as it is allowed by the volume plugin’s AccessModes policy).

Why is this not simply "Two or more pods with the same volume should never fail
(as long as it is allowed by the volume plugin’s AccessModes policy). same
node or not should have no bearing.

Edit: I read the use case later - is there an issue open? Link?

One or more pods referencing different partitions of the same volume should
not fail

This should be P2 at best - I am not convinced we should really handle this any
more.

On initial startup, the volume manager will read the
/var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/ and
/var/lib/kubelet/pods/{podID}/volumes/ directories to figure out which
volumes were attached and mounted to the node before it went down and
pre-populate the in-memory cache.

a) .../plugins/kubernetes.io/{plugin}/mounts/ is overly-specific. You really
mean ../plugins/{plugin}/mounts/ where {plugin} is a two-part path. But in
hindsight that should probably have been escaped. Maybe we can fix that?

b) The intention was that anything under ../plugins/{plugin} is private to
that plugin. You can't assume that a mounts dir exists or what it means.

The volume name used as the key for this table and in the volume name
in-memory cache will be a unique name that includes the plugin name and the
unique name the plugin uses to identify the volume

What does that mean? Is there a place where the volume plugin can report a
globally unique name for a volume? Or are you synthesizing that through a
pod UID + volume name?

Execute the volume-specific logic to mount the volume to
/var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/

The path is really GetPluginDir({plugin})/{private} unless we need to
standardize the structure of that further.

ghost · 2016-04-06T18:11:47Z

Just to provide an update on some things @saad-ali and I had agreed on (I should have updated the doc earlier :/)

We will try to do this without caching any additional information. Basically just use the directory structure and the pod list to decide what to do (mount, unmount etc)
The above amounts to keeping all the logic currently in the kubelet but calling it from a manager which essentially serializes it.
The worker thread which launches the pod polls the manager until that pod's volumes are ready (as specified in the doc)
Once we have achieved correctness with the above we can optimize by parallelizing the manager and only run operations on the same volume in series. To do that we'll need a unique key for each volume (actual storage asset that is). We'll need to extend the plugin interface to provide that because only the plugin would be able to tell that two volumes point to the same device underneath

saad-ali · 2016-04-19T04:36:09Z

Summarizing offline discussions:

is there an issue open? Link?

Same issue as the previous item, updated and added.

This should be P2 at best - I am not convinced we should really handle this any more.

The key words here are "as long as it doesn’t violate volume plugin’s AccessModes policy". Which means we won't have to do anything special for it. Basically a partition will be treated the same as a volume. For example, if pod A and pod B reference two different partitions on the same volume, if the pods are scheduled to different nodes, only if the AccessModes of the underlying volume allow it will it be attached to both nodes. Basically we are doing nothing to override the volume access policy.

{plugin} is a two-part path. But in hindsight that should probably have been escaped. Maybe we can fix that?

Will look into it. But unlikely to do it, because backwards compatibility will be painful.

The intention was that anything under ../plugins/{plugin} is private to that plugin. You can't assume that a mounts dir exists or what it means

We can add a new method to the volume plugin to return the mount path for that plugin.

What does that mean? Is there a place where the volume plugin can report a globally unique name for a volume? Or are you synthesizing that through a pod UID + volume name?

Unique name will be {plugin name}/{volume name} so something like kubernetes.io~gce/volume1. The idea here is that we should be able to uniquely identify a disk even if it is referenced under different volume mount/claim names and if two different plugins use the same volume name they shouldn't collide.

The path is really GetPluginDir({plugin})/{private} unless we need to standardize the structure of that further.

As long as we're using the method mentioned above (new method to the volume plugin) to get the new mounts directory, we should be able to control its contents (the plugin can decide where it wants it, we'll decide what goes inside it).

saad-ali · 2016-04-19T04:46:23Z

I am thinking now that the problem, at least as described above, would be fixed by the serialization of mount/unmount/attach/detach operations. That is if a pod has a volume mounted as read-only and is deleted then the detach operation would have to complete before the attach operation for a new pod is starts. If a new pod is scheduled but the old pod has not been deleted yet then the the master will not allow it because of a disk conflict. WDYT ?

Only if your in-memory cache identifies the RO RW requests for the volume as two different things (which if it does, you'll want to persist it to disk to handle crashes). If you don't differentiate between the two modes, consider the rapid delete recreate scenario: Volume X is mounted in RW gets deleted and immediately recreated as RO. If the logic does not differentiating between the two, it just sees a new pod referencing a volume that is already attached, nothing to do here. And no need to trigger detach because even though the original volume is gone, there is a new pod referencing the "same volume", so we'll skip detach for now.

Automatic merge from submit-queue Add data structure for managing go routines by name This PR introduces a data structure for managing go routines by name. It prevents the creation of new go routines if an existing go routine with the same name exists. This will enable parallelization of the designs in #20262 and #21931 with sufficient protection to prevent starting multiple operations on the same volume.

Automatic merge from submit-queue Kubelet Volume Attach/Detach/Mount/Unmount Redesign This PR redesigns the Volume Attach/Detach/Mount/Unmount in Kubelet as proposed in #21931 ```release-note A new volume manager was introduced in kubelet that synchronizes volume mount/unmount (and attach/detach, if attach/detach controller is not enabled). This eliminates the race conditions between the pod creation loop and the orphaned volumes loops. It also removes the unmount/detach from the `syncPod()` path so volume clean up never blocks the `syncPod` loop. ```

saad-ali · 2016-06-19T22:06:15Z

Closed with #26801 which will be part of v1.3.

(cherry picked from commit e0ff14b)

…ubernetes#21931

Automatic merge from submit-queue Kubelet Volume Attach/Detach/Mount/Unmount Redesign This PR redesigns the Volume Attach/Detach/Mount/Unmount in Kubelet as proposed in kubernetes/kubernetes#21931 ```release-note A new volume manager was introduced in kubelet that synchronizes volume mount/unmount (and attach/detach, if attach/detach controller is not enabled). This eliminates the race conditions between the pod creation loop and the orphaned volumes loops. It also removes the unmount/detach from the `syncPod()` path so volume clean up never blocks the `syncPod` loop. ```

saad-ali added kind/design Categorizes issue or PR as related to design. sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster labels Feb 24, 2016

saad-ali mentioned this issue Feb 24, 2016

GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953

Closed

saad-ali mentioned this issue Mar 7, 2016

EBS volume mount failures due to "... already attached to an instance" are not retried #18785

Closed

saad-ali mentioned this issue Mar 23, 2016

Rename volume.Builder to Mounter and volume.Cleaner to Unmounter #23368

Merged

thockin mentioned this issue Apr 25, 2016

Come up with a clear story for attaching/detaching PD semantics #11879

Closed

saad-ali mentioned this issue Apr 27, 2016

Add data structure for managing go routines by name #24838

Merged

brandonweeks mentioned this issue Apr 28, 2016

Support multi availability zone deployments on AWS coreos/coreos-kubernetes#100

Closed

ghost mentioned this issue Apr 29, 2016

Move kubelet volume code to volumes.go #24975

Closed

This was referenced Jun 3, 2016

Kubelet Volume Attach/Detach/Mount/Unmount Redesign #26801

Merged

e2e flake: kubelet should be able to delete 10 pods per node in 1m0s #23591

Closed

saad-ali closed this as completed Jun 19, 2016

novakg pushed a commit to novakg/kubernetes that referenced this issue Jan 16, 2017

Fixing OpenStack cinder volume detaching issue. kubernetes#21931

e0ff14b

novakg pushed a commit to novakg/kubernetes that referenced this issue Feb 1, 2017

Fixing OpenStack cinder volume detaching issue. kubernetes#21931

bec1462

(cherry picked from commit e0ff14b)

goloxxly pushed a commit to goloxxly/kubernetes that referenced this issue Apr 25, 2017

Merging changes of: Fixing OpenStack cinder volume detaching issue. k…

2b8aed0

…ubernetes#21931

This was referenced Apr 29, 2019

详解 Kubernetes Volume 的实现原理 guevara/read-it-later#3532

Open

详解 Kubernetes Volume 的实现原理 guevara/read-it-later#3533

Open

详解 Kubernetes Volume 的实现原理 guevara/read-it-later#3534

Open

shawn-hurley mentioned this issue Aug 22, 2023

Add support for block volumes vmware-tanzu/velero#6680

Merged

3 tasks

PhanLe1010 mentioned this issue Dec 7, 2023

[BUG] Kubelet cannot finish terminating a pod that uses a PVC with volumeMode: Block when restarting the node longhorn/longhorn#6919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detailed Design for Volume Mount/Unmount Redesign #21931

Detailed Design for Volume Mount/Unmount Redesign #21931

saad-ali commented Feb 24, 2016 •

edited

Loading

jsafrane commented Feb 25, 2016

ghost commented Feb 25, 2016

saad-ali commented Feb 26, 2016

saad-ali commented Feb 26, 2016

ghost commented Feb 26, 2016

ghost commented Mar 17, 2016

thockin commented Apr 6, 2016

ghost commented Apr 6, 2016

saad-ali commented Apr 19, 2016

saad-ali commented Apr 19, 2016

saad-ali commented Jun 19, 2016 •

edited

Loading

Detailed Design for Volume Mount/Unmount Redesign #21931

Detailed Design for Volume Mount/Unmount Redesign #21931

Comments

saad-ali commented Feb 24, 2016 • edited Loading

Objective

Background

Solution Overview

Detailed Design

Primary Control Loop

Attach Volume and Mount Device

Bind Mount to Pod Specific Location

Unmount Bind Mount from Pod Specific Location

Unmount Device and Detach Volume

jsafrane commented Feb 25, 2016

ghost commented Feb 25, 2016

saad-ali commented Feb 26, 2016

saad-ali commented Feb 26, 2016

ghost commented Feb 26, 2016

ghost commented Mar 17, 2016

thockin commented Apr 6, 2016

ghost commented Apr 6, 2016

saad-ali commented Apr 19, 2016

saad-ali commented Apr 19, 2016

saad-ali commented Jun 19, 2016 • edited Loading

saad-ali commented Feb 24, 2016 •

edited

Loading

saad-ali commented Jun 19, 2016 •

edited

Loading