-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Job Controller Performance #127228
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Great to see the great performance improvement! |
aab7309
to
215cc0c
Compare
215cc0c
to
e45eef1
Compare
@@ -772,6 +766,8 @@ func (jm *Controller) syncJob(ctx context.Context, key string) (rErr error) { | |||
|
|||
// if job was finished previously, we don't want to redo the termination | |||
if util.IsJobFinished(&job) { | |||
// The job shouldn't be marked as finished until all pod finalizers are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally true and protected by the code here:
kubernetes/pkg/controller/job/job_controller.go
Lines 1398 to 1403 in abc0568
if uncounted := job.Status.UncountedTerminatedPods; uncounted != nil { | |
if count := len(uncounted.Succeeded) + len(uncounted.Failed); count > 0 { | |
logger.V(4).Info("Delaying marking the Job as finished, because there are still uncounted pod(s)", "job", klog.KObj(job), "condition", jobCtx.finishedCondition.Type, "count", count) | |
return false | |
} | |
} |
So, we may not need this code here, unless we motivate it by being a safe-guard in case of a bug, or user setting terminal condition on the Job via API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a safeguard
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
/retest |
Also, I have noticed from my tests that,
@alculquicondor - FYI ^^^ |
And what was the number before the change? |
I think that's what I was saying in my previous comment, it is 500k ( from what i heard from David) , from what I have seen in my tests w/o changes for same load it is around 122k ish @alculquicondor Also, what this also tell us is basically there is still value in delete event handler sync path as well. :D |
ok, so we effectively reduced the load to half. Great! /lgtm |
LGTM label has been added. Git tree hash: ea4184a9e1871d6f2c7274cc8063f9edcf5cc5fa
|
Please add a release note |
/remove-kind regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @hakuna-matatah
Nice work!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, hakuna-matatah, mengqiy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor This PR doesn't have any user facing change, so I guess we may not need a release note. |
Performance improvements are user facing. |
The release note doesn't sound accurate and it should just focus on the user-facing behavior, as opposed to internal details (we have the PR description for that). /release-note-edit
|
What type of PR is this?
/kind feature
/kind regression
What this PR does / why we need it:
Following up on this - #126510 (comment)
This PR helps improve the Job controller performance in sync path
With this change, I have seen at least the update event handler sync path for job controller performance p99 increased by at least
128X
for cluster with10k+ jobs
.Cluster with similar load for Job Update event handler number w/o this change has
p99 13.02 ms
(vs) cluster w similar load for Job update event handler w this change hasp99 0.1024ms
i.e 12600% faster.Comparing the update sync path of Job controller (vs) ttl-finished controller which pretty much acts on incoming job event handling notifications.
We can see that for roughly handling around 880K update events, two controllers p99 seems to be almost similar now
i.e; p99 for
job update event handler
is b/w0.0512 ms
and0.1024ms
and ttl-finished-controller update event handler is <
0.0512 ms
Which issue(s) this PR fixes:
Fixes #126510
Special notes for your reviewer:
I haven't made changes to
job delete event handler
. See the difference in performance b/w changes inJob update handler
(VS) w/o changes injob delete event handler
. See below data forjob delete event handler
The p99 for
job delete event handler
for processing 178K notifications is roughly around26.2144 ms
Performance has improved by approximately
255 times
when comparing26.2144 ms
to0.1024 ms
(though one is update and other is delete).Thus with above data we can summarize that major bottleneck is
cleanUpPodFinalizers
in sync path for Job controller and increases performance at least by double digit for 10k jobs or more and brings it closer to ttl-finished-controller handler duration time.Does this PR introduce a user-facing change?