Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Job Controller Performance #127228

Merged
merged 2 commits into from
Sep 11, 2024
Merged

Conversation

hakuna-matatah
Copy link
Contributor

@hakuna-matatah hakuna-matatah commented Sep 9, 2024

What type of PR is this?

/kind feature
/kind regression

What this PR does / why we need it:

Following up on this - #126510 (comment)

This PR helps improve the Job controller performance in sync path

With this change, I have seen at least the update event handler sync path for job controller performance p99 increased by at least 128X for cluster with 10k+ jobs.

Cluster with similar load for Job Update event handler number w/o this change has p99 13.02 ms (vs) cluster w similar load for Job update event handler w this change has p99 0.1024ms i.e 12600% faster.

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 116917
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 291934
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 404434
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 589226
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 625432
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 633231
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 635917
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 636703
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 647305
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 820939
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 948647
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 955181
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 955999
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 956053
job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 2.148933816e+09
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 956053

Comparing the update sync path of Job controller (vs) ttl-finished controller which pretty much acts on incoming job event handling notifications.

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 132254
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 384777
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 742505
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 850312
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 861308
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 863176
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 863448
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 863526
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 863584
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 863615
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 863629
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 863639
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 863650
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 863651
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 863651

job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 1.4756581e+07
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 863651

ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.1"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.2"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.4"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="0.8"} 47818
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1.6"} 275487
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3.2"} 642007
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6.4"} 816296
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="12.8"} 871174
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="25.6"} 879438
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="51.2"} 880307
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="102.4"} 880444
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="204.8"} 880559
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="409.6"} 880621
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="819.2"} 880648
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="1638.4"} 880668
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="3276.8"} 880679
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="6553.6"} 880682
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="13107.2"} 880683
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="26214.4"} 880683
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="52428.8"} 880684
ttl_after_finished_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="update",le="+Inf"} 880684
ttl_after_finished_controller_event_handling_duration_microseconds_sum{resource="jobs",type="update"} 2.706518e+06
ttl_after_finished_controller_event_handling_duration_microseconds_count{resource="jobs",type="update"} 880684

We can see that for roughly handling around 880K update events, two controllers p99 seems to be almost similar now

i.e; p99 for job update event handler is b/w 0.0512 ms and 0.1024ms
and ttl-finished-controller update event handler is < 0.0512 ms

Which issue(s) this PR fixes:

Fixes #126510

Special notes for your reviewer:

I haven't made changes to job delete event handler. See the difference in performance b/w changes in Job update handler (VS) w/o changes in job delete event handler. See below data for job delete event handler

job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.1"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="0.8"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="1.6"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="3.2"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="6.4"} 0
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="12.8"} 1
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="25.6"} 114
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="51.2"} 1229
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="102.4"} 5322
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="204.8"} 12645
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="409.6"} 18345
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="819.2"} 28267
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="1638.4"} 41339
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="3276.8"} 68261
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="6553.6"} 92852
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="13107.2"} 152230
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="26214.4"} 178060
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="52428.8"} 178870
job_controller_event_handling_duration_microseconds_bucket{resource="jobs",type="delete",le="+Inf"} 178896
job_controller_event_handling_duration_microseconds_sum{resource="jobs",type="delete"} 1.236442415e+09
job_controller_event_handling_duration_microseconds_count{resource="jobs",type="delete"} 178896

The p99 for job delete event handler for processing 178K notifications is roughly around 26.2144 ms

Performance has improved by approximately 255 times when comparing 26.2144 ms to 0.1024 ms (though one is update and other is delete).

Thus with above data we can summarize that major bottleneck is cleanUpPodFinalizers in sync path for Job controller and increases performance at least by double digit for 10k jobs or more and brings it closer to ttl-finished-controller handler duration time.

Does this PR introduce a user-facing change?

Improve performance of the job controller when handling job update events.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 9, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 9, 2024
@hakuna-matatah
Copy link
Contributor Author

/cc @mengqiy @alculquicondor @mimowo

@hakuna-matatah hakuna-matatah changed the title [WIP] Increase Job Controller Performance [WIP] Improve Job Controller Performance Sep 9, 2024
@mengqiy
Copy link
Member

mengqiy commented Sep 9, 2024

Great to see the great performance improvement!
IMO we should also confirm the throughput improvement. e.g. Running the repro in #126510 (comment) should not result a huge pendingNotification backlog any more with this change.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 9, 2024
@hakuna-matatah hakuna-matatah changed the title [WIP] Improve Job Controller Performance Improve Job Controller Performance Sep 9, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2024
@@ -772,6 +766,8 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (rErr error) {

// if job was finished previously, we don't want to redo the termination
if util.IsJobFinished(&job) {
// The job shouldn't be marked as finished until all pod finalizers are removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally true and protected by the code here:

if uncounted := job.Status.UncountedTerminatedPods; uncounted != nil {
if count := len(uncounted.Succeeded) + len(uncounted.Failed); count > 0 {
logger.V(4).Info("Delaying marking the Job as finished, because there are still uncounted pod(s)", "job", klog.KObj(job), "condition", jobCtx.finishedCondition.Type, "count", count)
return false
}
}
.

So, we may not need this code here, unless we motivate it by being a safe-guard in case of a bug, or user setting terminal condition on the Job via API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a safeguard

Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
@hakuna-matatah
Copy link
Contributor Author

/retest

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Sep 10, 2024

Also, I have noticed from my tests that, pending_notifications count with this change for 10K+ cronjobs is around 60k, as opposed to around 500k+ based on what @mengqiy told me for 10k cronjobs.

# TYPE cache_pending_notifications gauge
cache_pending_notifications{name="job-controller",resources="job"} 56499
cache_pending_notifications{name="job-controller",resources="pod"} 0
cache_pending_notifications{name="ttl-after-finished-controller",resources="job"} 0

@alculquicondor - FYI ^^^

@alculquicondor
Copy link
Member

And what was the number before the change?

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Sep 11, 2024

And what was the number before the change?

I think that's what I was saying in my previous comment, it is 500k ( from what i heard from David) , from what I have seen in my tests w/o changes for same load it is around 122k ish @alculquicondor

Also, what this also tell us is basically there is still value in delete event handler sync path as well. :D

@alculquicondor
Copy link
Member

ok, so we effectively reduced the load to half. Great!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ea4184a9e1871d6f2c7274cc8063f9edcf5cc5fa

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2024
@alculquicondor
Copy link
Member

Please add a release note

@alculquicondor
Copy link
Member

/remove-kind regression
/kind feature

@k8s-ci-robot k8s-ci-robot removed the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Sep 11, 2024
Copy link
Member

@mengqiy mengqiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @hakuna-matatah
Nice work!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, hakuna-matatah, mengqiy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mengqiy
Copy link
Member

mengqiy commented Sep 11, 2024

@alculquicondor This PR doesn't have any user facing change, so I guess we may not need a release note.
@hakuna-matatah If that's the case, we can just add an empty release note block in the PR description.

@alculquicondor
Copy link
Member

This PR doesn't have any user facing change, so I guess we may not need a release note.

Performance improvements are user facing.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. release-note-none Denotes a PR that doesn't merit a release note. labels Sep 11, 2024
@alculquicondor
Copy link
Member

The release note doesn't sound accurate and it should just focus on the user-facing behavior, as opposed to internal details (we have the PR description for that).

/release-note-edit

Improve performance of the job controller when handling job update events.

@k8s-ci-robot k8s-ci-robot merged commit 60cbbdf into kubernetes:master Sep 11, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

The event handlers of job controller are slow
5 participants