-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle force detach #67847
Handle force detach #67847
Conversation
/sig storage |
This fixes a regression caused by #66835 |
/approve |
/lgtm |
/cc @shay-berman |
@verult: GitHub didn't allow me to request PR reviews from the following users: shay-berman. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ac31aef
to
aca2645
Compare
@verult captured the failing case with integration tests now. Also manaually tested the PR on AWS. It should work. PTAL. |
// we can detect the terminated pods(volumes) as well as pods(volumes) which are not terminated but in unavailable node here. | ||
// if pod.DeletionTimestamp is set, and after pod.DeletionGracePeriodSeconds, the pod is still not deleted, we assume that | ||
// the pod is in broken node, and we should trigger detach for pod volumes. | ||
// BTW: if the node is unavailable, we will trigger pod deletion operation after 5 minutes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 5min param is configurable with the pod-eviction-timeout controller manager flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
Name: "fakepod", | ||
ContainerID: "abcdee", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the container State need to be set to Running, to make it more explicit?
And it might be more realistic if ContainerStatus is added in the initial pod object, before deletionTimestamp is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, adding ContainerStatus in the initial pod object seems better.
// here we do not need to wait for 5 minutes, set deletion timestamp and grace period directly | ||
pod.DeletionTimestamp = &metav1.Time{Time: time.Now()} | ||
dgp := int64(1) | ||
pod.DeletionGracePeriodSeconds = &dgp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test start up a live master? If so could it overwrite DeletionGracePeriod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can set the TerminationGracePeriodSeconds
value to 1 second too.
@yastij some overlap yes but essentially these are 2 different things. Your feature enables skipping of volume mount verification if node is known to have "shutdown" taint whereas this feature makes sure that volumes from pods in "unknown" but "deleted" state are detached. |
SGTM |
aca2645
to
7699bd8
Compare
/test pull-kubernetes-kubemark-e2e-gce-big |
/lgtm Thanks @NickrenREN and @gnufied ! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: childsb, gnufied, jsafrane, verult If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @saad-ali for approval |
@gnufied: GitHub didn't allow me to assign the following users: for, approval. Note that only kubernetes members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM one question
// deleted pod that should be gone and volumes in it should be removed from dswp if they are not deleted after DeletionGracePeriod seconds | ||
// this can happen if a node is lost, and the kubelet is never able to confirm deletion. | ||
func isPodNeededToBeRemoved(pod *v1.Pod) bool { | ||
if pod.DeletionTimestamp != nil && pod.DeletionGracePeriodSeconds != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about:
podStatus.Phase == v1.PodFailed || podStatus.Phase == v1.PodSucceeded
If pod errors out or completes but hasn't been deleted yet? IsPodTerminated(...)
currently handles that as a termination state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod/Volumes in termination state should be detached.
If the assumption above is right, it is ok here because both IsPodTerminated(...)
and isPodNeededToBeRemoved
are called in DetermineVolumeAction
.
Correct me please if i am wrong, @saad-ali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saad-ali yeah IsPodTerminated
handles those as terminated state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, true, in those cases the third statement isn't checked.
please review my comment -> #65392 (comment) and let me know what do you think. |
Based on ongoing discussion in linked github issue. /hold |
/uncc |
/test all Tests are more than 96 hours old. Re-running tests. |
/close |
@gnufied: Closing this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gnufied what is the mitigation of this closed PR? |
Fix volume detach from shutdown nodes.
This is a variant of #67419
Fixes #65392