-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forcefully detach the volumes on pod deletion if kubelet is unavailable #67419
Conversation
/retest |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation only checks for grace period upon pod update, so it won't trigger pod deletion from the DSW at the right time.
Talked to @saad-ali offline, and his proposal is to delete the pod from the DSW as soon as deletionTimestamp
is set, configure the detach timeout that waits for volume unmount to be 6min + the pod's deletionGracePeriod
. If multiple pods on the same node has the volume mounted, go for the longest wait time to be safe, i.e. choose the latest pod deletionTimeStamp + deletionGracePeriod
to begin the 6min countdown.
@@ -155,9 +156,30 @@ func DetermineVolumeAction(pod *v1.Pod, desiredStateOfWorld cache.DesiredStateOf | |||
// should be detached or not | |||
return keepTerminatedPodVolume | |||
} | |||
if isPodNeededToBeRemoved(pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will remove all pods after the grace period, not just terminated pods.
Maybe we should have two different versions of IsPodTerminated
call, one that checks for container status and one that doesn't (for this method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the containers status checking is done in IsPodTerminated
. which is also called in DetermineVolumeAction
.
should we still add container status checking After grace period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason where DeletionTimestamp
will be set for pods that are not being terminated? again - I think we should verify with sig-node and confirm at what point, we should consider a pod as "deleted" and hence proceed with detaching volumes it was using. cc @sjenning @derekwaynecarr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may not be enough to fix the bug, because while this function will cause removal of the pod from DSWP, I think the next function that adds the pods will add it right back and volume will not get detached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, |
cc @sjenning likely this is also related to bug we saw on AWS where volumes from shutdown nodes are not being detached because pods are left in "Running" phase but has |
@@ -135,3 +135,117 @@ func TestFindAndAddActivePods_FindAndRemoveDeletedPods(t *testing.T) { | |||
} | |||
|
|||
} | |||
|
|||
func TestFindAndRemoveDeletedPodsInFailedNodes(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try and get some e2e tests for this. IIRC - this code path broke before too..(but for different reason).
@NickrenREN ACK. It might be helpful to also document somewhere all the timing parameters involved, including deletionGracePeriod, the DSWP sync loop sleep period, and later the detach timeout for actual detach |
da1e64b
to
8871dc4
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: NickrenREN If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
09fe821
to
80a03cf
Compare
some related work on the node shutdown side #66213 |
80a03cf
to
c990142
Compare
33f0551
to
a12d4a7
Compare
a12d4a7
to
ad45eeb
Compare
ad45eeb
to
642f545
Compare
/retest |
/close |
@NickrenREN: Closing this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Forcely detach the volumes on pod deletion if kubelet is unavailable
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #65392
Special notes for your reviewer:
Release note:
/sig storage
/kind bug
/assign @verult