Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Controller Manager] Fix job controller hot loop #32785

Merged

Conversation

m1093782566
Copy link
Contributor

@m1093782566 m1093782566 commented Sep 15, 2016

What this PR does / why we need it:

Fix Job controller hotloop on unexpected API server rejections.

Which issue this PR fixes

Related issue is #30629

Special notes for your reviewer:

@deads2k @derekwaynecarr PTAL.


This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Can a kubernetes member verify that this patch is reasonable to test? If so, please reply with "ok to test" on its own line.

Regular contributors should join the org to skip this step.

While we transition away from the Jenkins GitHub PR Builder plugin, "ok to test" commenters will need to be on the admin list defined in this file.

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Sep 15, 2016
@deads2k deads2k self-assigned this Sep 15, 2016
@@ -167,7 +173,7 @@ func (jm *JobController) getPodJob(pod *api.Pod) *batch.Job {
return nil
}
if len(jobs) > 1 {
glog.Errorf("user error! more than one job is selecting pods with labels: %+v", pod.Labels)
utilruntime.HandleError(fmt.Errorf("user error! more than one job is selecting pods with labels: %+v", pod.Labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh why do we have a message here that only a cluster-admin can see, but only a project-admin can fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the manual label selector which will soon be gone when I remove extensions.Job. Backwards compatibility, mainly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I keep the error message here or just remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it for now, extensions.Jobs are not removed, yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@@ -271,27 +277,35 @@ func (jm *JobController) enqueueController(obj interface{}) {
// all controllers there will still be some replica instability. One way to handle this is
// by querying the store for all controllers that this rc overlaps, as well as all
// controllers that overlap this rc, and sorting them.
jm.queue.Add(key)
jm.queue.AddRateLimited(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be ratelimited. This is a primary add

Copy link
Contributor Author

@m1093782566 m1093782566 Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, will fix.

obj, exists, err := jm.jobStore.Store.GetByKey(key)
if !exists {
glog.V(4).Infof("Job has been deleted: %v", key)
jm.expectations.DeleteExpectations(key)
return nil
}
if err != nil {
glog.Errorf("Unable to retrieve job %v from store: %v", key, err)
jm.queue.Add(key)
utilruntime.HandleError(fmt.Errorf("Unable to retrieve job %v from store: %v", key, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't print the error twice. Just return the error from here and let the higher level logic report it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fixed.

glog.Errorf("Couldn't get key for job %#v: %v", job, err)
return err
utilruntime.HandleError(fmt.Errorf("Couldn't get key for job %#v: %v", job, err))
// We explicitly return nil to avoid re-enqueue the bad key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment

}
jobNeedsSync := jm.expectations.SatisfiedExpectations(jobKey)
selector, _ := unversioned.LabelSelectorAsSelector(job.Spec.Selector)
pods, err := jm.podStore.Pods(job.Namespace).List(selector)
if err != nil {
glog.Errorf("Error getting pods for job %q: %v", key, err)
jm.queue.Add(key)
utilruntime.HandleError(fmt.Errorf("Error getting pods for job %q: %v", key, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't print the error twice. Just return the error from here and let the higher level logic report it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fixed.

@@ -419,7 +424,7 @@ func (jm *JobController) syncJob(key string) error {
job.Status.Failed = failed

if err := jm.updateHandler(&job); err != nil {
glog.Errorf("Failed to update job %v, requeuing. Error: %v", job.Name, err)
utilruntime.HandleError(fmt.Errorf("Failed to update job %v, requeuing. Error: %v", job.Name, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should return the error here and let the requeuing be done by the error handling

Copy link
Contributor Author

@m1093782566 m1093782566 Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I wonder if key is same with jobKey? Since jm.queue.AddRateLimited(jKey) will enqueue the former.

If the same, why we do controller.keyFunc(&job) here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key and jobKey look to be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... I've noticed we're copying this pattern in most (if not all) our controllers. Seems ok to remove.

Copy link
Contributor Author

@m1093782566 m1093782566 Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh If you agree, I will remove these lines and simply use key we passed to syncJob(key) to replace jobKey inside the function. Then, I will simply use return err to replace these lines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for it. Actually while fixing other controllers look up similar lines and remove them if possible, as well. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will roll back :)

}

func (jm *JobController) processNextWorkItem() bool {
jKey, quit := jm.queue.Get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just key

@deads2k
Copy link
Contributor

deads2k commented Sep 19, 2016

nit on name, then lgtm. Go ahead and squash.

@derekwaynecarr
Copy link
Member

agree with @deads2k comments, thanks for picking this up.

@derekwaynecarr derekwaynecarr added this to the v1.5 milestone Sep 19, 2016
@derekwaynecarr derekwaynecarr added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 19, 2016
@@ -40,6 +40,7 @@ import (
"k8s.io/kubernetes/pkg/util/wait"
"k8s.io/kubernetes/pkg/util/workqueue"
"k8s.io/kubernetes/pkg/watch"
"fmt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we group imports. this should be in lines 20-23 range. do you use goimports?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yeah, just this one nit and david's comments, then this is lgtm.

@m1093782566 m1093782566 force-pushed the m109-job-controller-hot-loop branch 3 times, most recently from 1a50c4c to 8b18703 Compare September 20, 2016 03:24
@m1093782566
Copy link
Contributor Author

m1093782566 commented Sep 20, 2016

@deads2k @derekwaynecarr Comments addressed.

Additional, I modify a UT to pass CI and modify copyright year to 2016. PTAL again.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2016
@m1093782566 m1093782566 force-pushed the m109-job-controller-hot-loop branch from 8b18703 to b047496 Compare September 20, 2016 03:31
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 20, 2016
@m1093782566 m1093782566 force-pushed the m109-job-controller-hot-loop branch 2 times, most recently from ce39e45 to 9b65522 Compare September 20, 2016 03:58
@@ -1,5 +1,5 @@
/*
Copyright 2014 The Kubernetes Authors.
Copyright 2016 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change the year. That should stay as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... Okay, but I wonder why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... Okay, but I wonder why?

I think this file was originally authored (under a different name) in 2014.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suspect to reflect the initial time this was committed. We never update those.

@m1093782566 m1093782566 force-pushed the m109-job-controller-hot-loop branch from 9b65522 to 28a0acc Compare September 20, 2016 14:23
Change-Id: I55ce706381f1494e5cd2064177b938f56d9c356a
@m1093782566 m1093782566 force-pushed the m109-job-controller-hot-loop branch from 28a0acc to 27cc90c Compare September 20, 2016 14:25
@m1093782566
Copy link
Contributor Author

@soltysh comments addressed. PTAL.

Thanks!

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will wait for @derekwaynecarr final signoff.

@soltysh
Copy link
Contributor

soltysh commented Sep 20, 2016

@k8s-bot ok to test

@derekwaynecarr derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2016
@derekwaynecarr
Copy link
Member

LGTM

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit ad7ba62 into kubernetes:master Sep 20, 2016
@m1093782566 m1093782566 deleted the m109-job-controller-hot-loop branch September 21, 2016 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants