-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add activeDeadlineSeconds to kubeadm upgrade-health-check job #127333
Add activeDeadlineSeconds to kubeadm upgrade-health-check job #127333
Conversation
Welcome @yuyabee! |
Hi @yuyabee. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR, this seems mostly OK, but need to double check a couple of aspects to make sure we are not introducing more complexity...
/cc @carlory
@@ -141,6 +141,7 @@ func createJob(client clientset.Interface, cfg *kubeadmapi.ClusterConfiguration) | |||
Spec: batchv1.JobSpec{ | |||
BackoffLimit: ptr.To[int32](0), | |||
TTLSecondsAfterFinished: ptr.To[int32](int32(timeout.Seconds()) + 5), // Make sure it's more than 'timeout'. | |||
ActiveDeadlineSeconds: ptr.To[int64](int64(timeout.Seconds())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to confirm:
kubernetes/staging/src/k8s.io/apiserver/pkg/apis/example/v1/types.go
Lines 127 to 131 in a7702cb
// Optional duration in seconds the pod may be active on the node relative to | |
// StartTime before the system will actively try to mark it failed and kill associated containers. | |
// Value must be a positive integer. | |
// +optional | |
ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty" protobuf:"varint,5,opt,name=activeDeadlineSeconds"` |
this means that if there is an error such as "image cannot be pulled" the job will fail?
that would be the goal.
we don't want the job to succeed because of the Deadline field here. ideally it should always fail if this timeout is reached,
another Q should it be a value between timeout and timeout +5? is it OK to match the value to the timeout of the poll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that if there is an error such as "image cannot be pulled" the job will fail?
Yes and it should always fail if this timeout is reached.
For more details, after reaching ActiveDeadlineSeconds
since the start timestamp, the pod fails and the job will have the following condition, marking it for garbage collection thanks to TTLSecondsAfterFinished
, if the image is not available:
status:
conditions:
- lastProbeTime: xxx + 20 + something
lastTransitionTime: xxx + 20
message: Job was active longer than specified deadline
reason: DeadlineExceeded
status: "True"
type: Failed
failed: 1
ready: 0
startTime: xxx
terminating: 0
uncountedTerminatedPods: {}
Notice the lastTransitionTime is 20 seconds after the startTime with activeDeadlineSeconds set to 20.
In a success case, the type Complete will make the kubeadm to mark this preflight as pass:
status:
completionTime: xxx + a few sec
conditions:
- lastProbeTime: xxx +. a few sec
lastTransitionTime: xxx + a few sec
status: "True"
type: Complete
ready: 0
startTime: xxx
succeeded: 1
terminating: 0
uncountedTerminatedPods: {}
another Q should it be a value between timeout and timeout +5? is it OK to match the value to the timeout of the poll?
Let me make this change quickly (also consolidate into one shared var.
/release-note-edit
|
^ is that an accurate release note? |
/ok-to-test |
With kubernetes#122079, kubeadm now relies on `ttlSecondsAfterFinished` to clean up `upgrade-health-check` once its pod reaches a terminal state. However, there is a case where the pod won't reach a terminal state and the job will not register a terminal state, hence no garbage collection. For example, if the pause image is not present, `ErrImagePull` will make the pod keep retrying to pull the image and the pod will never reach a terminal state on its own. And the job will continue to wait for the pod to reach a terminal state which will not happen. So we need to set `activeDeadlineSeconds` to prevent the job from waiting forever for the pod to reach a terminal state. Without this, users invoking `kubeadm upgrade plan` need to cleanup the job outside of kubeadm even if they ignore the preflight result because the job still runs when the result is configured to be ignored via `--ignore-prelight-errors=CreateJob` flag. Since the timeout for the polling in the `CreateJob` step in kubeadm is 15 seconds, we should set the `activeDeadlineSeconds` to the same timeout.
dd3317a
to
db66416
Compare
The release note looks good to me, thank you! |
/lgtm |
LGTM label has been added. Git tree hash: 0713f08ae41bb61a69b0417501fba3e4395d22ea
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neolit123, yuyabee The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
thanks
…333-upstream-release-1.30 Automated cherry pick of #127333: Add activeDeadlineSeconds to kubeadm upgrade-health-check job
…333-upstream-release-1.31 Automated cherry pick of #127333: Add activeDeadlineSeconds to kubeadm upgrade-health-check job
With #122079, kubeadm now relies on
ttlSecondsAfterFinished
to clean upupgrade-health-check
once its pod reaches a terminal state. However, there is a case where the pod won't reach a terminal state and the job will not register a terminal state, hence no garbage collection.For example, if the pause image is not present,
ErrImagePull
will make the pod keep retrying to pull the image and the pod will never reach a terminal state on its own. And the job will continue to wait for the pod to reach a terminal state which will not happen.So we need to set
activeDeadlineSeconds
to prevent the job from waiting forever for the pod to reach a terminal state.Without this, users invoking
kubeadm upgrade plan
need to cleanup the job outside of kubeadm even if they ignore the preflight result because the job still runs when the result is configured to be ignored via--ignore-prelight-errors=CreateJob
flag.Since the timeout for the polling in the
CreateJob
step in kubeadm is 15 seconds, we should set theactiveDeadlineSeconds
to the same timeout.What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: