-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Controller Manager] Fix job controller hot loop #32785
[Controller Manager] Fix job controller hot loop #32785
Conversation
Can a kubernetes member verify that this patch is reasonable to test? If so, please reply with "ok to test" on its own line. Regular contributors should join the org to skip this step. While we transition away from the Jenkins GitHub PR Builder plugin, "ok to test" commenters will need to be on the admin list defined in this file. |
@@ -167,7 +173,7 @@ func (jm *JobController) getPodJob(pod *api.Pod) *batch.Job { | |||
return nil | |||
} | |||
if len(jobs) > 1 { | |||
glog.Errorf("user error! more than one job is selecting pods with labels: %+v", pod.Labels) | |||
utilruntime.HandleError(fmt.Errorf("user error! more than one job is selecting pods with labels: %+v", pod.Labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh why do we have a message here that only a cluster-admin can see, but only a project-admin can fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the manual label selector which will soon be gone when I remove extensions.Job. Backwards compatibility, mainly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I keep the error message here or just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep it for now, extensions.Jobs are not removed, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@@ -271,27 +277,35 @@ func (jm *JobController) enqueueController(obj interface{}) { | |||
// all controllers there will still be some replica instability. One way to handle this is | |||
// by querying the store for all controllers that this rc overlaps, as well as all | |||
// controllers that overlap this rc, and sorting them. | |||
jm.queue.Add(key) | |||
jm.queue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be ratelimited. This is a primary add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, will fix.
obj, exists, err := jm.jobStore.Store.GetByKey(key) | ||
if !exists { | ||
glog.V(4).Infof("Job has been deleted: %v", key) | ||
jm.expectations.DeleteExpectations(key) | ||
return nil | ||
} | ||
if err != nil { | ||
glog.Errorf("Unable to retrieve job %v from store: %v", key, err) | ||
jm.queue.Add(key) | ||
utilruntime.HandleError(fmt.Errorf("Unable to retrieve job %v from store: %v", key, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't print the error twice. Just return the error from here and let the higher level logic report it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fixed.
glog.Errorf("Couldn't get key for job %#v: %v", job, err) | ||
return err | ||
utilruntime.HandleError(fmt.Errorf("Couldn't get key for job %#v: %v", job, err)) | ||
// We explicitly return nil to avoid re-enqueue the bad key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comment
} | ||
jobNeedsSync := jm.expectations.SatisfiedExpectations(jobKey) | ||
selector, _ := unversioned.LabelSelectorAsSelector(job.Spec.Selector) | ||
pods, err := jm.podStore.Pods(job.Namespace).List(selector) | ||
if err != nil { | ||
glog.Errorf("Error getting pods for job %q: %v", key, err) | ||
jm.queue.Add(key) | ||
utilruntime.HandleError(fmt.Errorf("Error getting pods for job %q: %v", key, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't print the error twice. Just return the error from here and let the higher level logic report it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fixed.
@@ -419,7 +424,7 @@ func (jm *JobController) syncJob(key string) error { | |||
job.Status.Failed = failed | |||
|
|||
if err := jm.updateHandler(&job); err != nil { | |||
glog.Errorf("Failed to update job %v, requeuing. Error: %v", job.Name, err) | |||
utilruntime.HandleError(fmt.Errorf("Failed to update job %v, requeuing. Error: %v", job.Name, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should return the error here and let the requeuing be done by the error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key
and jobKey
look to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I've noticed we're copying this pattern in most (if not all) our controllers. Seems ok to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh If you agree, I will remove these lines and simply use key
we passed to syncJob(key) to replace jobKey
inside the function. Then, I will simply use return err
to replace these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go for it. Actually while fixing other controllers look up similar lines and remove them if possible, as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will roll back :)
} | ||
|
||
func (jm *JobController) processNextWorkItem() bool { | ||
jKey, quit := jm.queue.Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just key
nit on name, then lgtm. Go ahead and squash. |
agree with @deads2k comments, thanks for picking this up. |
@@ -40,6 +40,7 @@ import ( | |||
"k8s.io/kubernetes/pkg/util/wait" | |||
"k8s.io/kubernetes/pkg/util/workqueue" | |||
"k8s.io/kubernetes/pkg/watch" | |||
"fmt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we group imports. this should be in lines 20-23 range. do you use goimports
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, yeah, just this one nit and david's comments, then this is lgtm.
1a50c4c
to
8b18703
Compare
@deads2k @derekwaynecarr Comments addressed. Additional, I modify a UT to pass CI and modify copyright year to 2016. PTAL again. |
8b18703
to
b047496
Compare
ce39e45
to
9b65522
Compare
@@ -1,5 +1,5 @@ | |||
/* | |||
Copyright 2014 The Kubernetes Authors. | |||
Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change the year. That should stay as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... Okay, but I wonder why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... Okay, but I wonder why?
I think this file was originally authored (under a different name) in 2014.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suspect to reflect the initial time this was committed. We never update those.
9b65522
to
28a0acc
Compare
Change-Id: I55ce706381f1494e5cd2064177b938f56d9c356a
28a0acc
to
27cc90c
Compare
@soltysh comments addressed. PTAL. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will wait for @derekwaynecarr final signoff.
@k8s-bot ok to test |
LGTM |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
What this PR does / why we need it:
Fix Job controller hotloop on unexpected API server rejections.
Which issue this PR fixes
Related issue is #30629
Special notes for your reviewer:
@deads2k @derekwaynecarr PTAL.
This change is