Stop StartupProbe explicity when successThrethold is reached #121206

mochizuki875 · 2023-10-13T02:55:52Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

StartupProbe is executed while container is starting and will update container status to Started=true if it succeeds.
When container status becomes to Started=true, StartupProbe will be stopped and Readiness/Liveness probe will be executed.
Probe and the main loop of kubelet that updates the container status are executed asynchronously, and the goroutine of the Probe requests the container status update to the main loop of kubelet via the channel.
Due to this asyncronicity, StartupProbe may be executed more than successThrethold.(#117153)

In this PR, StartupProbe is stopped explicity when successThrethold is reached.

Which issue(s) this PR fixes:

Fixes #117153

Special notes for your reviewer:

Does this PR introduce a user-facing change?

`StartupProbe` is stopped explicity when `successThrethold` is reached. 
This eliminates the problem that `StartupProbe` is executed more than `successThrethold`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-10-13T02:56:03Z

Hi @mochizuki875. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mochizuki875 · 2023-10-13T02:56:21Z

/sig node

mochizuki875 · 2023-10-13T02:57:50Z

/cc @SergeyKanzhelev

HirazawaUi · 2023-10-17T14:34:17Z

/ok-to-test

pkg/kubelet/prober/worker.go

bart0sh · 2023-10-23T17:29:08Z

/triage accepted
/priority important-soon

haircommander · 2023-10-26T17:21:49Z

pkg/kubelet/prober/worker_test.go

+		expectContinue(t, w, w.doProbe(ctx), msg)
+		expectResult(t, w, results.Success, msg)
+		expectResultRun(t, w, 0, msg)
+	}


are there any tests we can do down here to prove the prober stopped and doesn't continue? From what I can see, we only test that it continues before the threshold is met, but don't explicitly test it's stopped

@haircommander

Thanks!

In this test case, successThreshold is set to 1, so
i=0: the successThreshold is not met and probe should be executed
i=1(=successThreshold): the successThreshold is met and probe should not be executed

I've added check whether probe will be executed or not in each case using w.onHold with some comments.
If successThreshold is exceeded, w.onHold turn to true and probe will not be executed anymore.

k8s-triage-robot · 2024-08-18T15:12:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

HirazawaUi · 2024-08-18T15:13:55Z

/remove-lifecycle stale

HirazawaUi · 2024-08-18T16:31:06Z

pkg/kubelet/prober/worker.go

@@ -316,11 +316,14 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {

 	w.resultsManager.Set(w.containerID, result, w.pod)

-	if (w.probeType == liveness || w.probeType == startup) && result == results.Failure {
+	if (w.probeType == liveness && result == results.Failure) || (w.probeType == startup && (result == results.Success || result == results.Failure)) {


Suggested change

if (w.probeType == liveness && result == results.Failure) || (w.probeType == startup && (result == results.Success || result == results.Failure)) {

if (w.probeType == liveness && result == results.Failure) || (w.probeType == startup && result != results.Unknown) {

or

Suggested change

if (w.probeType == liveness && result == results.Failure) || (w.probeType == startup && (result == results.Success || result == results.Failure)) {

if (w.probeType == liveness && result == results.Failure) || w.probeType == startup {

After the w.probeManager.prober.probe() is executed, will the startup probe still enter the unknown state?

As long as the startup probe exceeds its FailureThreshold or SuccessThreshold, regardless of whether it succeeds or fails, we will stop executing it. Is that correct?

@HirazawaUi
Thank you for your comment.

After the w.probeManager.prober.probe() is executed, will the startup probe still enter the unknown state?

As I've checked the codebase, w.probeManager.prober.probe() will return err with unknown state.
So I think it's enough to check w.probeType == startup.

As long as the startup probe exceeds its FailureThreshold or SuccessThreshold, regardless of whether it succeeds or fails, we will stop executing it. Is that correct?

As far as I can tell from checking the doc, that seems correct.
Do you have any thoughts on this?

It looks good to me now.
/lgtm

k8s-ci-robot · 2024-08-26T13:45:00Z

LGTM label has been added.

Git tree hash: 0f29de3495b1cc4331ccf56b5e4b89ea5144eb11

pacoxu · 2024-10-17T10:04:31Z

/assign @matthyx @SergeyKanzhelev

matthyx · 2024-10-17T10:08:52Z

yes this works
/lgtm

SergeyKanzhelev · 2024-10-17T22:02:49Z

pkg/kubelet/prober/worker.go

@@ -316,11 +316,14 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {

 	w.resultsManager.Set(w.containerID, result, w.pod)

-	if (w.probeType == liveness || w.probeType == startup) && result == results.Failure {
+	if (w.probeType == liveness && result == results.Failure) || w.probeType == startup {


I don't understand this. So for startup probe we will stop after the first run? This condition doesn't check the result to be a failure.

@SergeyKanzhelev

Currently, w.onHold = true will be set when liveness or startup probe is Failure.
However, for asynchronous reason(I've referred to that in this PR description), w.onHold = true also should also be set when startup probe is Success. Otherwise, startup probe may be executed redundantly.

Similar discussion has been done in:
#121206 (comment)
#121206 (comment)

w.onHold = true also should also be set when startup probe is Success.

What I see in code is that we set it unconditionally for startup type. Not when it is a startup AND succeeded. What am I missing here?

@SergeyKanzhelev

What I see in code is that we set it unconditionally for startup type. Not when it is a startup AND succeeded.

Yes, that's right.
So I've said

should also be set when startup probe is Success.

Regarding Startup Probe, w.onHold = true should be set in each case:

failureThreshold is reached and last result is Failure

SuccessThreshold(Must be 1) is reached and last result is Success

Threshold check will be done just before this like that, and if it has not been reached, it returns true.

kubernetes/pkg/kubelet/prober/worker.go

Lines 316 to 320 in aa8f287

if (result == results.Failure && w.resultRun < int(w.spec.FailureThreshold)) ||

(result == results.Success && w.resultRun < int(w.spec.SuccessThreshold)) {

// Success or failure is below threshold - leave the probe state unchanged.

return true

}

Therefore, at this point, it means that one of the thresholds has been reached, and I think that w.onHold = true should be set for the Startup Probe regardless of the result.
(Currently, it is only for the case of Failure, but also for Success.)

/lgtm
/approve

k8s-ci-robot · 2024-11-05T23:10:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mochizuki875, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [SergeyKanzhelev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-05T23:58:36Z

@mochizuki875: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-kind-ipv6	`632f162`	link	unknown	`/test pull-kubernetes-e2e-kind-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

SergeyKanzhelev · 2024-11-06T00:02:41Z

/restart

mochizuki875 · 2024-11-06T00:34:23Z

/retest

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 13, 2023

k8s-ci-robot requested review from odinuge and Random-Liu October 13, 2023 02:56

k8s-ci-robot added the area/kubelet label Oct 13, 2023

k8s-ci-robot requested a review from SergeyKanzhelev October 13, 2023 02:57

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 17, 2023

bart0sh reviewed Oct 20, 2023

View reviewed changes

pkg/kubelet/prober/worker.go Outdated Show resolved Hide resolved

haircommander reviewed Oct 26, 2023

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2024

HirazawaUi reviewed Aug 18, 2024

View reviewed changes

mochizuki875 added 4 commits August 26, 2024 02:01

Stop StartupProbe explicity when successThrethold is reached

aa9a9e8

add status check of startupProbe

11ddb97

add threshold check using w.onHold

ead2102

fix from rv comment

a262c80

mochizuki875 force-pushed the fix_startup_probe_117153 branch from bef809f to a262c80 Compare August 26, 2024 03:09

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2024

k8s-ci-robot requested review from haircommander and mrunalp August 26, 2024 03:09

fix unit test

632f162

k8s-ci-robot assigned HirazawaUi Aug 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2024

k8s-ci-robot assigned matthyx and SergeyKanzhelev Oct 17, 2024

SergeyKanzhelev reviewed Oct 17, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2024

k8s-ci-robot merged commit 2f3da71 into kubernetes:master Nov 6, 2024
13 of 14 checks passed

k8s-ci-robot added this to the v1.32 milestone Nov 6, 2024

mochizuki875 deleted the fix_startup_probe_117153 branch November 6, 2024 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop StartupProbe explicity when successThrethold is reached #121206

Stop StartupProbe explicity when successThrethold is reached #121206

mochizuki875 commented Oct 13, 2023

k8s-ci-robot commented Oct 13, 2023

mochizuki875 commented Oct 13, 2023

mochizuki875 commented Oct 13, 2023

HirazawaUi commented Oct 17, 2023

bart0sh commented Oct 23, 2023

haircommander Oct 26, 2023

mochizuki875 Oct 27, 2023

k8s-triage-robot commented Aug 18, 2024

HirazawaUi commented Aug 18, 2024

HirazawaUi Aug 18, 2024 •

edited

Loading

mochizuki875 Aug 26, 2024

HirazawaUi Aug 26, 2024

k8s-ci-robot commented Aug 26, 2024

pacoxu commented Oct 17, 2024

matthyx commented Oct 17, 2024

SergeyKanzhelev Oct 17, 2024

mochizuki875 Oct 18, 2024

SergeyKanzhelev Oct 23, 2024

mochizuki875 Oct 24, 2024

SergeyKanzhelev Nov 5, 2024

k8s-ci-robot commented Nov 5, 2024

k8s-ci-robot commented Nov 5, 2024

SergeyKanzhelev commented Nov 6, 2024

mochizuki875 commented Nov 6, 2024

	if (w.probeType == liveness && result == results.Failure) \|\| (w.probeType == startup && (result == results.Success \|\| result == results.Failure)) {
	if (w.probeType == liveness && result == results.Failure) \|\| (w.probeType == startup && result != results.Unknown) {

	if (result == results.Failure && w.resultRun < int(w.spec.FailureThreshold)) \|\|
	(result == results.Success && w.resultRun < int(w.spec.SuccessThreshold)) {
	// Success or failure is below threshold - leave the probe state unchanged.
	return true
	}

Stop StartupProbe explicity when successThrethold is reached #121206

Stop StartupProbe explicity when successThrethold is reached #121206

Conversation

mochizuki875 commented Oct 13, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 13, 2023

mochizuki875 commented Oct 13, 2023

mochizuki875 commented Oct 13, 2023

HirazawaUi commented Oct 17, 2023

bart0sh commented Oct 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Aug 18, 2024

HirazawaUi commented Aug 18, 2024

HirazawaUi Aug 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 26, 2024

pacoxu commented Oct 17, 2024

matthyx commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 5, 2024

k8s-ci-robot commented Nov 5, 2024

SergeyKanzhelev commented Nov 6, 2024

mochizuki875 commented Nov 6, 2024

HirazawaUi Aug 18, 2024 •

edited

Loading