Improve Job Controller Performance #127228

hakuna-matatah · 2024-09-09T00:56:22Z

What type of PR is this?

/kind feature
/kind regression

What this PR does / why we need it:

Following up on this - #126510 (comment)

This PR helps improve the Job controller performance in sync path

With this change, I have seen at least the update event handler sync path for job controller performance p99 increased by at least 128X for cluster with 10k+ jobs.

Cluster with similar load for Job Update event handler number w/o this change has p99 13.02 ms (vs) cluster w similar load for Job update event handler w this change has p99 0.1024ms i.e 12600% faster.

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 116917
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 291934
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 404434
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 589226
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 625432
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 633231
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 635917
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 636703
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 647305
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 820939
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 948647
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 955181
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 955999
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 956053
job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 2.148933816e+09
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 956053

Comparing the update sync path of Job controller (vs) ttl-finished controller which pretty much acts on incoming job event handling notifications.

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 132254
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 384777
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 742505
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 850312
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 861308
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 863176
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 863448
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 863526
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 863584
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 863615
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 863629
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 863639
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 863650
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 863651
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 863651

job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 1.4756581e+07
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 863651

ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 275487
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 642007
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 816296
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 871174
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 879438
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 880307
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 880444
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 880559
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 880621
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 880648
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 880668
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 880679
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 880682
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 880683
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 880683
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 880684
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 880684
ttl_after_finished_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 2.706518e+06
ttl_after_finished_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 880684

We can see that for roughly handling around 880K update events, two controllers p99 seems to be almost similar now

i.e; p99 for job update event handler is b/w 0.0512 ms and 0.1024ms
and ttl-finished-controller update event handler is < 0.0512 ms

Which issue(s) this PR fixes:

Fixes #126510

Special notes for your reviewer:

I haven't made changes to job delete event handler. See the difference in performance b/w changes in Job update handler (VS) w/o changes in job delete event handler. See below data for job delete event handler

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="6.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="12.8"} 1
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="25.6"} 114
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="51.2"} 1229
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="102.4"} 5322
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="204.8"} 12645
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="409.6"} 18345
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="819.2"} 28267
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="1638.4"} 41339
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="3276.8"} 68261
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="6553.6"} 92852
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="13107.2"} 152230
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="26214.4"} 178060
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="52428.8"} 178870
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="+Inf"} 178896
job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="delete"} 1.236442415e+09
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="delete"} 178896

The p99 for job delete event handler for processing 178K notifications is roughly around 26.2144 ms

Performance has improved by approximately 255 times when comparing 26.2144 ms to 0.1024 ms (though one is update and other is delete).

Thus with above data we can summarize that major bottleneck is `cleanUpPodFinalizers` in sync path for Job controller and increases performance at least by double digit for 10k jobs or more and brings it closer to ttl-finished-controller handler duration time.

Does this PR introduce a user-facing change?

Improve performance of the job controller when handling job update events.

k8s-ci-robot · 2024-09-09T00:56:30Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hakuna-matatah · 2024-09-09T00:57:49Z

/cc @mengqiy @alculquicondor @mimowo

pkg/controller/job/job_controller.go

pkg/controller/job/job_controller_test.go

mengqiy · 2024-09-09T20:28:30Z

Great to see the great performance improvement!
IMO we should also confirm the throughput improvement. e.g. Running the repro in #126510 (comment) should not result a huge pendingNotification backlog any more with this change.

mimowo · 2024-09-10T07:46:26Z

pkg/controller/job/job_controller.go

@@ -772,6 +766,8 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (rErr error) {

 	// if job was finished previously, we don't want to redo the termination
 	if util.IsJobFinished(&job) {
+		// The job shouldn't be marked as finished until all pod finalizers are removed.


This is generally true and protected by the code here:

kubernetes/pkg/controller/job/job_controller.go

Lines 1398 to 1403 in abc0568

if uncounted := job.Status.UncountedTerminatedPods; uncounted != nil {

if count := len(uncounted.Succeeded) + len(uncounted.Failed); count > 0 {

logger.V(4).Info("Delaying marking the Job as finished, because there are still uncounted pod(s)", "job", klog.KObj(job), "condition", jobCtx.finishedCondition.Type, "count", count)

return false

}

}

.

So, we may not need this code here, unless we motivate it by being a safe-guard in case of a bug, or user setting terminal condition on the Job via API.

It is a safeguard

pkg/controller/job/job_controller.go

Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

hakuna-matatah · 2024-09-10T18:36:41Z

/retest

hakuna-matatah · 2024-09-10T23:46:49Z

Also, I have noticed from my tests that, pending_notifications count with this change for 10K+ cronjobs is around 60k, as opposed to around 500k+ based on what @mengqiy told me for 10k cronjobs.

# TYPE cache_pending_notifications gauge
cache_pending_notifications{name="job-controller",resources="job"} 56499
cache_pending_notifications{name="job-controller",resources="pod"} 0
cache_pending_notifications{name="ttl-after-finished-controller",resources="job"} 0

@alculquicondor - FYI ^^^

alculquicondor · 2024-09-11T12:25:47Z

And what was the number before the change?

hakuna-matatah · 2024-09-11T14:56:39Z

And what was the number before the change?

I think that's what I was saying in my previous comment, it is 500k ( from what i heard from David) , from what I have seen in my tests w/o changes for same load it is around 122k ish @alculquicondor

Also, what this also tell us is basically there is still value in delete event handler sync path as well. :D

alculquicondor · 2024-09-11T18:08:40Z

ok, so we effectively reduced the load to half. Great!

/lgtm
/approve

k8s-ci-robot · 2024-09-11T18:08:48Z

LGTM label has been added.

Git tree hash: ea4184a9e1871d6f2c7274cc8063f9edcf5cc5fa

alculquicondor · 2024-09-11T18:08:58Z

Please add a release note

alculquicondor · 2024-09-11T18:09:19Z

/remove-kind regression
/kind feature

mengqiy

/lgtm
Thanks @hakuna-matatah
Nice work!

k8s-ci-robot · 2024-09-11T18:21:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, hakuna-matatah, mengqiy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mengqiy · 2024-09-11T18:23:42Z

@alculquicondor This PR doesn't have any user facing change, so I guess we may not need a release note.
@hakuna-matatah If that's the case, we can just add an empty release note block in the PR description.

alculquicondor · 2024-09-11T20:16:15Z

This PR doesn't have any user facing change, so I guess we may not need a release note.

Performance improvements are user facing.

alculquicondor · 2024-09-11T20:33:24Z

The release note doesn't sound accurate and it should just focus on the user-facing behavior, as opposed to internal details (we have the PR description for that).

/release-note-edit

Improve performance of the job controller when handling job update events.

k8s-ci-robot requested review from krmayankk and mimowo September 9, 2024 00:56

k8s-ci-robot requested review from alculquicondor and mengqiy September 9, 2024 00:57

hakuna-matatah changed the title ~~[WIP] Increase Job Controller Performance~~ [WIP] Improve Job Controller Performance Sep 9, 2024

alculquicondor reviewed Sep 9, 2024

View reviewed changes

pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved

pkg/controller/job/job_controller_test.go Show resolved Hide resolved

hakuna-matatah force-pushed the master branch from aab7309 to 215cc0c Compare September 9, 2024 21:10

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 9, 2024

Increase Job Controller Performance

e45eef1

hakuna-matatah force-pushed the master branch from 215cc0c to e45eef1 Compare September 9, 2024 21:12

hakuna-matatah changed the title ~~[WIP] Improve Job Controller Performance~~ Improve Job Controller Performance Sep 9, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2024

mimowo reviewed Sep 10, 2024

View reviewed changes

alculquicondor reviewed Sep 10, 2024

View reviewed changes

pkg/controller/job/job_controller.go Show resolved Hide resolved

Update pkg/controller/job/job_controller.go

d66c9ef

Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

k8s-ci-robot assigned alculquicondor Sep 11, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2024

k8s-ci-robot removed the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 11, 2024

mengqiy approved these changes Sep 11, 2024

View reviewed changes

k8s-ci-robot assigned mengqiy Sep 11, 2024

k8s-ci-robot merged commit 60cbbdf into kubernetes:master Sep 11, 2024
14 checks passed

k8s-ci-robot added this to the v1.32 milestone Sep 11, 2024

This was referenced Sep 15, 2024

Improve performance of Job controller delete event handler #127378

Merged

The event handlers of job controller are slow #126510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Job Controller Performance #127228

Improve Job Controller Performance #127228

hakuna-matatah commented Sep 9, 2024 •

edited by k8s-ci-robot

Loading

k8s-ci-robot commented Sep 9, 2024

hakuna-matatah commented Sep 9, 2024

mengqiy commented Sep 9, 2024

mimowo Sep 10, 2024

alculquicondor Sep 10, 2024

hakuna-matatah commented Sep 10, 2024

hakuna-matatah commented Sep 10, 2024 •

edited

Loading

alculquicondor commented Sep 11, 2024

hakuna-matatah commented Sep 11, 2024 •

edited

Loading

alculquicondor commented Sep 11, 2024

k8s-ci-robot commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

mengqiy left a comment

k8s-ci-robot commented Sep 11, 2024

mengqiy commented Sep 11, 2024 •

edited

Loading

alculquicondor commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

	if uncounted := job.Status.UncountedTerminatedPods; uncounted != nil {
	if count := len(uncounted.Succeeded) + len(uncounted.Failed); count > 0 {
	logger.V(4).Info("Delaying marking the Job as finished, because there are still uncounted pod(s)", "job", klog.KObj(job), "condition", jobCtx.finishedCondition.Type, "count", count)
	return false
	}
	}

Improve Job Controller Performance #127228

Improve Job Controller Performance #127228

Conversation

hakuna-matatah commented Sep 9, 2024 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Comparing the update sync path of Job controller (vs) ttl-finished controller which pretty much acts on incoming job event handling notifications.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Thus with above data we can summarize that major bottleneck is cleanUpPodFinalizers in sync path for Job controller and increases performance at least by double digit for 10k jobs or more and brings it closer to ttl-finished-controller handler duration time.

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 9, 2024

hakuna-matatah commented Sep 9, 2024

mengqiy commented Sep 9, 2024

mimowo Sep 10, 2024

Choose a reason for hiding this comment

alculquicondor Sep 10, 2024

Choose a reason for hiding this comment

hakuna-matatah commented Sep 10, 2024

hakuna-matatah commented Sep 10, 2024 • edited Loading

alculquicondor commented Sep 11, 2024

hakuna-matatah commented Sep 11, 2024 • edited Loading

alculquicondor commented Sep 11, 2024

k8s-ci-robot commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

mengqiy left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 11, 2024

mengqiy commented Sep 11, 2024 • edited Loading

alculquicondor commented Sep 11, 2024

alculquicondor commented Sep 11, 2024

hakuna-matatah commented Sep 9, 2024 •

edited by k8s-ci-robot

Loading

Thus with above data we can summarize that major bottleneck is `cleanUpPodFinalizers` in sync path for Job controller and increases performance at least by double digit for 10k jobs or more and brings it closer to ttl-finished-controller handler duration time.

hakuna-matatah commented Sep 10, 2024 •

edited

Loading

hakuna-matatah commented Sep 11, 2024 •

edited

Loading

mengqiy commented Sep 11, 2024 •

edited

Loading