Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle force detach #67847

Closed
wants to merge 3 commits into from
Closed

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Aug 24, 2018

Fix volume detach from shutdown nodes.

This is a variant of #67419

Fixes #65392

Detach volumes from shutdown nodes

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 24, 2018
@k8s-ci-robot k8s-ci-robot requested review from fejta and jingxu97 August 24, 2018 19:30
@gnufied
Copy link
Member Author

gnufied commented Aug 24, 2018

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 24, 2018
@childsb
Copy link
Contributor

childsb commented Aug 24, 2018

This fixes a regression caused by #66835

@childsb
Copy link
Contributor

childsb commented Aug 24, 2018

/approve

@jsafrane
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018
@verult
Copy link
Contributor

verult commented Aug 27, 2018

/cc @shay-berman

@k8s-ci-robot
Copy link
Contributor

@verult: GitHub didn't allow me to request PR reviews from the following users: shay-berman.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @shay-berman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gnufied gnufied force-pushed the handle-force-detach branch from ac31aef to aca2645 Compare August 27, 2018 21:35
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 27, 2018
@gnufied
Copy link
Member Author

gnufied commented Aug 27, 2018

@verult captured the failing case with integration tests now. Also manaually tested the PR on AWS. It should work. PTAL.

// we can detect the terminated pods(volumes) as well as pods(volumes) which are not terminated but in unavailable node here.
// if pod.DeletionTimestamp is set, and after pod.DeletionGracePeriodSeconds, the pod is still not deleted, we assume that
// the pod is in broken node, and we should trigger detach for pod volumes.
// BTW: if the node is unavailable, we will trigger pod deletion operation after 5 minutes,
Copy link
Contributor

@verult verult Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5min param is configurable with the pod-eviction-timeout controller manager flag.

@NickrenREN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @verult,
updated on my branch #67419, please rebase again @gnufied

{
Name: "fakepod",
ContainerID: "abcdee",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the container State need to be set to Running, to make it more explicit?

And it might be more realistic if ContainerStatus is added in the initial pod object, before deletionTimestamp is set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding ContainerStatus in the initial pod object seems better.

// here we do not need to wait for 5 minutes, set deletion timestamp and grace period directly
pod.DeletionTimestamp = &metav1.Time{Time: time.Now()}
dgp := int64(1)
pod.DeletionGracePeriodSeconds = &dgp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test start up a live master? If so could it overwrite DeletionGracePeriod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can set the TerminationGracePeriodSeconds value to 1 second too.

@yastij
Copy link
Member

yastij commented Aug 28, 2018

@gnufied - does this overlap with #66213 ?

@gnufied
Copy link
Member Author

gnufied commented Aug 28, 2018

@yastij some overlap yes but essentially these are 2 different things. Your feature enables skipping of volume mount verification if node is known to have "shutdown" taint whereas this feature makes sure that volumes from pods in "unknown" but "deleted" state are detached.

@yastij
Copy link
Member

yastij commented Aug 28, 2018

SGTM

@gnufied gnufied force-pushed the handle-force-detach branch from aca2645 to 7699bd8 Compare August 28, 2018 16:33
@gnufied
Copy link
Member Author

gnufied commented Aug 28, 2018

/test pull-kubernetes-kubemark-e2e-gce-big

@verult
Copy link
Contributor

verult commented Aug 28, 2018

/lgtm

Thanks @NickrenREN and @gnufied !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: childsb, gnufied, jsafrane, verult
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

If they are not already assigned, you can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gnufied
Copy link
Member Author

gnufied commented Aug 28, 2018

/assign @saad-ali for approval

@k8s-ci-robot
Copy link
Contributor

@gnufied: GitHub didn't allow me to assign the following users: for, approval.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @saad-ali for approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM one question

// deleted pod that should be gone and volumes in it should be removed from dswp if they are not deleted after DeletionGracePeriod seconds
// this can happen if a node is lost, and the kubelet is never able to confirm deletion.
func isPodNeededToBeRemoved(pod *v1.Pod) bool {
if pod.DeletionTimestamp != nil && pod.DeletionGracePeriodSeconds != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about:

podStatus.Phase == v1.PodFailed || podStatus.Phase == v1.PodSucceeded

If pod errors out or completes but hasn't been deleted yet? IsPodTerminated(...) currently handles that as a termination state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod/Volumes in termination state should be detached.
If the assumption above is right, it is ok here because both IsPodTerminated(...) and isPodNeededToBeRemoved are called in DetermineVolumeAction.
Correct me please if i am wrong, @saad-ali

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali yeah IsPodTerminated handles those as terminated state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true, in those cases the third statement isn't checked.

@shay-berman
Copy link

please review my comment -> #65392 (comment) and let me know what do you think.

@gnufied
Copy link
Member Author

gnufied commented Aug 31, 2018

Based on ongoing discussion in linked github issue.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 31, 2018
@fejta
Copy link
Contributor

fejta commented Sep 10, 2018

/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from fejta September 10, 2018 13:39
@k8s-github-robot
Copy link

/test all

Tests are more than 96 hours old. Re-running tests.

@gnufied
Copy link
Member Author

gnufied commented Sep 12, 2018

/close

@k8s-ci-robot
Copy link
Contributor

@gnufied: Closing this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shay-berman
Copy link

@gnufied what is the mitigation of this closed PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.