-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make volume manager more robust across restarts #27653
Comments
/cc @kubernetes/sig-node @dchen1107 Item (2) will lead to irremovable pod directories (see #27651 for an example), and these directories will accumulate over time. kubelet'd also try to clean up every 2s and fail, wasting cycles and spamming the log. |
Only if kubelet is restarted and a pod with volumes mounted is deleted during that period. I agree it must be fixed. But I don't think it's a v1.3 blocker. I think it can wait for v1.3.1 @matchstick for triage. |
Almost every pod has a default secret volume, so this applies to most pods. The difference between this issue and many other issues is that it's not transient and haunts kubelet forever even after restarts, and would only get worse as time goes by.
Could this be documented as known issues in the release note if the decision is to table this for v1.3.1? Thanks! |
/cc @freehan, this is why your kubelet log is full of error messages :) |
Spoke with @matchstick. We're going to try to get it in for v1.3. If not, we'll punt it to post-1.3 |
FWIW, this is very easy to reproduce by using local-up-cluster.sh. Basically after local-up, create a pod, nuke the cluster, local-up again, and then we will see messages like |
@jingxu97 is working on implementing a clean up routine on startup |
A very simple script to reproduce this issue in an e2e cluster: #!/bin/bash
NODES=( $(gcloud compute instances list $ZONE | awk '{print $1}' | grep e2e-test- | grep minion) )
KUBECTL="./cluster/kubectl.sh"
RC="
apiVersion: v1
kind: ReplicationController
metadata:
name: foo
labels:
k8s-app: foo
spec:
replicas: 90
selector:
k8s-app: foo
template:
metadata:
labels:
k8s-app: foo
spec:
containers:
- image: gcr.io/google_containers/pause:2.0
name: bar
"
# Create N pods.
echo "${RC}" | ${KUBECTL} create -f -
# Sleep a while and assume the pods will be running by then.
sleep 60
# Frocefully delete the pods.
${KUBECTL} delete rc --grace-period=0 foo
# Killing kubelet on each node.
# NOTE: It will be more effective to reproduce this issue if killing with kubelet
# overlaps with the RC deletion command.
for NODE in "${NODES[@]}"; do
echo "Killing kubelet on node ${NODE}"
gcloud compute ssh ${NODE} --command "sudo pkill kubelet"
done
# Give kubelet time to start up.
sleep 120
# Grep the error messages from the logs.
for NODE in "${NODES[@]}"; do
echo "Checking error messages on node ${NODE}"
gcloud compute ssh ${NODE} --command "grep \"Failed to remove orphaned pod\" /var/log/kubelet.log"
done |
The design for this should also handle |
status on this? |
Punted to post 1.3. Hopefully part of 1.3.x release |
OK we need to make sure we don't lose track of this one, but it is not blocking 1.3 release. Raising to P0 and removing from 1.3 milestone. |
This is an issue marked v1.3 and kind/documentation. Can you get it fixed up this week? |
Automatic merge from submit-queue Add volume reconstruct/cleanup logic in kubelet volume manager Currently kubelet volume management works on the concept of desired and actual world of states. The volume manager periodically compares the two worlds and perform volume mount/unmount and/or attach/detach operations. When kubelet restarts, the cache of those two worlds are gone. Although desired world can be recovered through apiserver, actual world can not be recovered which may cause some volumes cannot be cleaned up if their information is deleted by apiserver. This change adds the reconstruction of the actual world by reading the pod directories from disk. The reconstructed volume information is added to both desired world and actual world if it cannot be found in either world. The rest logic would be as same as before, desired world populator may clean up the volume entry if it is no longer in apiserver, and then volume manager should invoke unmount to clean it up. Fixes #27653
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.
With bug kubernetes#27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.
Automatic merge from submit-queue (batch tested with PRs 41139, 41186, 38882, 37698, 42034) Make kubelet never delete files on mounted filesystems With bug #27653, kubelet could remove mounted volumes and delete user data. The bug itself is fixed, however our trust in kubelet is significantly lower. Let's add an extra version of RemoveAll that does not cross mount boundary (rm -rf --one-file-system). It calls lstat(path) three times for each removed directory - once in RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way it's platform independent and the directory that is being removed by kubelet should be almost empty.
Automatic merge from submit-queue Add volume reconstruct/cleanup logic in kubelet volume manager Currently kubelet volume management works on the concept of desired and actual world of states. The volume manager periodically compares the two worlds and perform volume mount/unmount and/or attach/detach operations. When kubelet restarts, the cache of those two worlds are gone. Although desired world can be recovered through apiserver, actual world can not be recovered which may cause some volumes cannot be cleaned up if their information is deleted by apiserver. This change adds the reconstruction of the actual world by reading the pod directories from disk. The reconstructed volume information is added to both desired world and actual world if it cannot be found in either world. The rest logic would be as same as before, desired world populator may clean up the volume entry if it is no longer in apiserver, and then volume manager should invoke unmount to clean it up. Fixes kubernetes/kubernetes#27653
Automatic merge from submit-queue Add volume reconstruct/cleanup logic in kubelet volume manager Currently kubelet volume management works on the concept of desired and actual world of states. The volume manager periodically compares the two worlds and perform volume mount/unmount and/or attach/detach operations. When kubelet restarts, the cache of those two worlds are gone. Although desired world can be recovered through apiserver, actual world can not be recovered which may cause some volumes cannot be cleaned up if their information is deleted by apiserver. This change adds the reconstruction of the actual world by reading the pod directories from disk. The reconstructed volume information is added to both desired world and actual world if it cannot be found in either world. The rest logic would be as same as before, desired world populator may clean up the volume entry if it is no longer in apiserver, and then volume manager should invoke unmount to clean it up. Fixes kubernetes/kubernetes#27653
When kubelet is restarted, volume manager loses its internal state. It rebuilds this state and calls attach/mount on volumes that may already be attached/mounted.
This issue is to track two items:
The text was updated successfully, but these errors were encountered: