-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detailed Design for Volume Mount/Unmount Redesign #21931
Comments
The proposal looks fine to me, only one thing scares me: you must not detach a volume before unmounting it. There may be some unwritten pages and detaching the volume could corrupt the filesystem or application data on it. Changing order of the operation will have some impact on your design, maybe you can unmount the volume first, then detach and only after that delete |
@saad-ali SGTM... I can start working on this unless you are already working on it |
@jsafrane Absolutely, that is one of the intentions of the design. I brain farted when I wrote that section. Fixed. Thanks for keen eye. |
Sami, go for it. I'll work on #20262 in parallel. There will be overlap between the two. We can coordinate over Slack. If you can carve out smaller PRs, that would be awesome. Feel free to schedule a VC if you want to discuss anything in depth. |
Sounds good!... I'll try to slice out thinner PRs and we'll coordinate |
I have been thinking about this. We discussed yesterday that if we do not use the cache and rely just on the directory structure that we would need add information to the path about read-write vs read-only attach modes to solve the above issue. I am thinking now that the problem, at least as described above, would be fixed by the serialization of mount/unmount/attach/detach operations. That is if a pod has a volume mounted as read-only and is deleted then the detach operation would have to complete before the attach operation for a new pod is starts. If a new pod is scheduled but the old pod has not been deleted yet then the the master will not allow it because of a disk conflict. WDYT ? This leads to the question of whether it is possible for the MountManager (as we will write it) to see the created pod before the deleted pod ? |
Just did a full read through. I think this might benefit from the same pseudocode treatment as the binder controller.
Why is this not simply "Two or more pods with the same volume should never fail Edit: I read the use case later - is there an issue open? Link?
This should be P2 at best - I am not convinced we should really handle this any
a) b) The intention was that anything under
What does that mean? Is there a place where the volume plugin can report a
The path is really |
Just to provide an update on some things @saad-ali and I had agreed on (I should have updated the doc earlier :/)
|
Summarizing offline discussions:
Same issue as the previous item, updated and added.
The key words here are "as long as it doesn’t violate volume plugin’s AccessModes policy". Which means we won't have to do anything special for it. Basically a partition will be treated the same as a volume. For example, if pod A and pod B reference two different partitions on the same volume, if the pods are scheduled to different nodes, only if the AccessModes of the underlying volume allow it will it be attached to both nodes. Basically we are doing nothing to override the volume access policy.
Will look into it. But unlikely to do it, because backwards compatibility will be painful.
We can add a new method to the volume plugin to return the mount path for that plugin.
Unique name will be {plugin name}/{volume name} so something like
As long as we're using the method mentioned above (new method to the volume plugin) to get the new mounts directory, we should be able to control its contents (the plugin can decide where it wants it, we'll decide what goes inside it). |
Only if your in-memory cache identifies the RO RW requests for the volume as two different things (which if it does, you'll want to persist it to disk to handle crashes). If you don't differentiate between the two modes, consider the rapid delete recreate scenario: Volume X is mounted in RW gets deleted and immediately recreated as RO. If the logic does not differentiating between the two, it just sees a new pod referencing a volume that is already attached, nothing to do here. And no need to trigger detach because even though the original volume is gone, there is a new pod referencing the "same volume", so we'll skip detach for now. |
Automatic merge from submit-queue Add data structure for managing go routines by name This PR introduces a data structure for managing go routines by name. It prevents the creation of new go routines if an existing go routine with the same name exists. This will enable parallelization of the designs in #20262 and #21931 with sufficient protection to prevent starting multiple operations on the same volume.
Automatic merge from submit-queue Kubelet Volume Attach/Detach/Mount/Unmount Redesign This PR redesigns the Volume Attach/Detach/Mount/Unmount in Kubelet as proposed in #21931 ```release-note A new volume manager was introduced in kubelet that synchronizes volume mount/unmount (and attach/detach, if attach/detach controller is not enabled). This eliminates the race conditions between the pod creation loop and the orphaned volumes loops. It also removes the unmount/detach from the `syncPod()` path so volume clean up never blocks the `syncPod` loop. ```
Closed with #26801 which will be part of v1.3. |
(cherry picked from commit e0ff14b)
Automatic merge from submit-queue Kubelet Volume Attach/Detach/Mount/Unmount Redesign This PR redesigns the Volume Attach/Detach/Mount/Unmount in Kubelet as proposed in kubernetes/kubernetes#21931 ```release-note A new volume manager was introduced in kubelet that synchronizes volume mount/unmount (and attach/detach, if attach/detach controller is not enabled). This eliminates the race conditions between the pod creation loop and the orphaned volumes loops. It also removes the unmount/detach from the `syncPod()` path so volume clean up never blocks the `syncPod` loop. ```
Automatic merge from submit-queue Kubelet Volume Attach/Detach/Mount/Unmount Redesign This PR redesigns the Volume Attach/Detach/Mount/Unmount in Kubelet as proposed in kubernetes/kubernetes#21931 ```release-note A new volume manager was introduced in kubelet that synchronizes volume mount/unmount (and attach/detach, if attach/detach controller is not enabled). This eliminates the race conditions between the pod creation loop and the orphaned volumes loops. It also removes the unmount/detach from the `syncPod()` path so volume clean up never blocks the `syncPod` loop. ```
Objective
Background
In the existing Kubernetes design the kubelet is responsible for determining what volumes to attach/mount and detach/unmount from the node it is running on.
The loop in Kubelet that is responsible for attaching and mounting volumes (the pod creation loop) is separate from and completely independent (on a separate thread) of the loop that is responsible for unmounting and detaching volumes (orphaned volumes loop). This leads to race conditions between the asynchronous pod creation and orphaned volumes loops.
Although there is some logic in the GCE PD, AWS, and Cinder plugins to make sure that the actual attach/detach operations don’t interrupt each other, there is no guarantee as to the order of the operations themselves. So, for example, when a pod is created and then rapidly deleted and recreated, kubelet attaches and mounts the pod, then (if the second attach operation is triggered before the detach operation, which often happens) kubelet will execute the second attach operation successfully (since the disk is already attached), and the pending detach operation will then result in a disk-in-use being unmounted (which appears like data loss to user).
To mask this behavior, kubelet currently fails attach operations if the disk is already attached. This allows the second attach operation to fail, and the subsequent detach operation to succeed; further retries of the attach operation then succeed.
Although this workaround masks a nasty bug (apparent data loss to the end user), it results in other unwanted behavior (bugs):
The Volume/Attach Detach Controller design (#20262) plans to move the attach/detach logic from kubelet to master, however the kubelet will still be responsible for mounting/unmounting (and, for backwards compatibility reasons, attach/detach in some cases), therefore these issues must be addressed.
Solution Overview
Introduce a new asynchronous loop, called the volume manager loop, in kubelet that handles attach/detach and mount/unmount in a serialized manner. The existing orphaned volumes loop will be removed, and the logic for unmounting/detaching volumes will be moved to the volume manager loop. Similarly, the logic for determining which volumes to mount/attach will be moved from the pod creation loop to the volume manager loop. The pod creation loop will simply poll the new volume manager until its volumes are ready for use (attached and mounted).
Detailed Design
The volume manager will maintain an in-memory cache containing a list of volumes that are required by the node (i.e. volumes that are referenced by pods scheduled to the node that the kubelet is running on). Each of these volumes will, in addition to the volume name, specify if the volume is mounted in read-only mode, and list the pods referencing the volume. This cache defines the state of the world according to the volume manager. This cache must be thread-safe.
On initial startup, the volume manager will read the
/var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/
and/var/lib/kubelet/pods/{podID}/volumes/
directories to figure out which volumes were attached and mounted to the node before it went down and pre-populate the in-memory cache.Primary Control Loop
The volume manager will have a loop that does the following:
PodPhase
Pending
check the volumes it defines or references (dereferencing anyPersistentVolumeClaims
to fetch associatedPersistentVolume
objects). For each of these volume:PodPhase
isSucceeded
orFailed
.Attach, detach, mount, and unmount operations can take a long time to complete, so the primary volume manager loop should not block on these operations. Instead the primary loop should spawn new threads for these operations. The number of threads that can be spawned will be capped (possibly using a thread pool) and once the cap is hit, subsequent requests will have to wait for a thread to become available.
To prevent multiple attach/detach or mount/unmount operations on the same volume, the main thread will maintain a table mapping volumes to currently active operations.
The volume name used as the key for this table and in the volume name in-memory cache will be a unique name that includes the plugin name and the unique name the plugin uses to identify the volume--not the volume name specified in the pod spec (because the same volume can be specified under two different pod definitions with different names).
Attach Volume and Mount Device
Attacher
interface:VolumeStatus
object under thePodStatus
for the volume to indicate that it issafeToMount
./var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/
.Bind Mount to Pod Specific Location
/var/lib/kubelet/plugins/kubernetes.io/{plugin}/attached/{uniqueVolumeName}/
)./var/lib/kubelet/pods/{podID}/volumes/{sanatizedPluginName}/{podSpecVolumeName}/
.Unmount Bind Mount from Pod Specific Location
Unmount Device and Detach Volume
/var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/
.VolumeStatus
object under thePodStatus
for the volume to indicate that it issafeToDetach
./var/lib/kubelet/plugins/kubernetes.io/{plugin}/mounts/{uniqueVolumeName}/
directory.Updated February 22, 2016: Fix order of execution of unmount/detach
The text was updated successfully, but these errors were encountered: