Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controllers hotloop on unexpected API server rejections #30629

Closed
8 tasks done
deads2k opened this issue Aug 15, 2016 · 20 comments
Closed
8 tasks done

Controllers hotloop on unexpected API server rejections #30629

deads2k opened this issue Aug 15, 2016 · 20 comments
Assignees
Labels
area/controller-manager sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@deads2k
Copy link
Contributor

deads2k commented Aug 15, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Aug 15, 2016

@kubernetes/rh-cluster-infra @ncdc probably need to do a survey to find other offenders.

I'm still trying to work out what happens to controllers which are denies on pod changes and don't retry. They don't hotloop, but are they stuck until the resync comes? There was a pull to remove resycning and people were declaring their controllers safe, but if I'm reading the code correctly, those people are mistaken.

@derekwaynecarr derekwaynecarr added this to the v1.4 milestone Aug 15, 2016
@bgrant0607 bgrant0607 added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed team/control-plane labels Aug 15, 2016
@bgrant0607
Copy link
Member

Related: #21912

cc @erictune @pwittrock

@derekwaynecarr
Copy link
Member

I agree we need a roll-up mechanism, but we definitely need to move all pertinent controllers to a rate limited queue in interim.

@adohe-zz
Copy link

/cc

@k8s-github-robot k8s-github-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 25, 2016
@erictune
Copy link
Member

@foxish did you say you looked at a rate limiting issue recently? Does it relate to this issue?

@foxish
Copy link
Contributor

foxish commented Aug 29, 2016

@erictune I did fix for petsets, and a bug which affects the rate-limiting work-queue implementation. They're all related in that I think the effort now is just switching the other controllers to use the same work-queue implementation with back-off.

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Sep 12, 2016

I would like this to be fixed in Kubernetes 1.5, and will look to get people from our team at Red Hat to fix it. Community help is welcome in the interim.

@derekwaynecarr derekwaynecarr self-assigned this Sep 12, 2016
k8s-github-robot pushed a commit that referenced this issue Sep 13, 2016
Automatic merge from submit-queue

update error handling for daemoncontroller

Updates the DaemonSet controller to cleanly requeue with ratelimiting on errors, make use of the `utilruntime.HandleError` consistently, and wait for preconditions before doing work.

@ncdc @liggitt @sttts My plan is to use this one as an example of how to handle requeuing, preconditions, and processing error handling.
@foxish fyi

related to #30629
@m1093782566
Copy link
Contributor

m1093782566 commented Sep 13, 2016

@deads2k @derekwaynecarr

I am interested in this job and I find there are some controllers which are still using workqueue.Type instead of workqueue.RateLimitingInterface. Here comes the list,

If there is no volunteer, I will be happy to fix them.

Of course, before doing it, I would like to receive comments from you :)

@deads2k
Copy link
Contributor Author

deads2k commented Sep 13, 2016

Of course, before doing it, I would like to receive comments from you :)

Help would be very welcome. I'd recommend trying to tackle a single controller first. You can take a look at the daemonset controller to see some of the tweaks

  1. Waiting for caches to be ready: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemoncontroller.go#L245
  2. Processing and error handling methods: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemoncontroller.go#L261-L284

@m1093782566
Copy link
Contributor

@deads2k very appreciated!

@derekwaynecarr
Copy link
Member

@m1093782566 -- please cc me and @deads2k on any PRs you open so we can track the work. A PR per controller you tackle is ideal. Thanks!

@m1093782566
Copy link
Contributor

sure

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 14, 2016

@deads2k @derekwaynecarr

The first PR for fixing certificate controller hot loop is #32664, PTAL.

I will open other PRs if you approve that.

@timothysc
Copy link
Member

/cc @rrati

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 15, 2016

@deads2k @derekwaynecarr

The 2nd PR for fixing endpoint controller hot loop is #32776, PTAL.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 15, 2016

@deads2k @derekwaynecarr

The 3rd PR for fixing job controller hot loop is #32785, PTAL.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 16, 2016

The 4th PR for fixing disruption controller hot loop is #32850

@m1093782566
Copy link
Contributor

@deads2k

replica set controller: (this one doesn't seem to retry on failed pod creates. I wonder why that is)

When I look at the code of replication controller and job controller(as they are familiar with replica set controller), I find they both retry on failed pod creates.

  1. replication controller, see https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replication/replication_controller.go#L601
  2. job controller, see https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/jobcontroller.go#L422

So, I assume replica set controller doesn't retry is a wrong behavior?

@deads2k
Copy link
Contributor Author

deads2k commented Sep 16, 2016

So, I assume replica set controller doesn't retry is a wrong behavior?

Yeah, it's probably wrong or unintentional.

k8s-github-robot pushed a commit that referenced this issue Sep 18, 2016
…er-hotloop

Automatic merge from submit-queue

[Controller Manager] Fix endpoint controller hot loop and use utilruntime.HandleError to replace glog.Errorf

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->

**Why**:

Fix endpoint controller hot loop and use `utilruntime.HandleError` to replace `glog.Errorf`

**What**

1. Fix endpoint controller hot loop in `pkg/controller/endpoint`

2.  Fix endpoint controller hot loop in `contrib/mesos/pkg/service`

3. Sweep cases of `glog.Errorf` and use `utilruntime.HandleError` instead.

**Which issue this PR fixes**

Fixes #32843
Related issue is #30629 

**Special notes for your reviewer**:

@deads2k @derekwaynecarr 

The changes on `pkg/controller/endpoints_controller.go` and `contrib/mesos/pkg/service/endpoints_controller.go` are almost the same except `contrib/mesos/pkg/service/endpoints_controller.go` does not pass `podInformer` as the parameter of `NewEndpointController()`. 

So, I didn't wait `podStoreSynced` before `syncService()`(Just leave it as it was). Will it lead to a problem?
k8s-github-robot pushed a commit that referenced this issue Sep 18, 2016
Automatic merge from submit-queue

[Controller Manager] Fix certificates controller hotloop and use utilruntime.HandleError to replace glog.Errorf

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->

**What this PR does / why we need it**:

Fix certificates controller hotloop on unexpected API server rejections.

**Which issue this PR fixes** 

Related issue is #30629

**Special notes for your reviewer**:

@deads2k @derekwaynecarr PTAL.

I find there is no unit test for certificates controller, and I will implement unit tests for it later.
k8s-github-robot pushed a commit that referenced this issue Sep 20, 2016
Automatic merge from submit-queue

[Controller Manager] Fix job controller hot loop

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->

**What this PR does / why we need it:**

Fix Job controller hotloop on unexpected API server rejections.

**Which issue this PR fixes**

Related issue is #30629

**Special notes for your reviewer:**

@deads2k @derekwaynecarr PTAL.
k8s-github-robot pushed a commit that referenced this issue Sep 29, 2016
Automatic merge from submit-queue

fix disruption controller hotloop

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->


Fix disruption controller hotloop on unexpected API server rejections.

**Which issue this PR fixes** 

Related issue is #30629

**Special notes for your reviewer**:

@deads2k @derekwaynecarr PTAL.
k8s-github-robot pushed a commit that referenced this issue Oct 8, 2016
Automatic merge from submit-queue

fix replica set hot loop

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->

**What this PR does / why we need it**:

Fix replicas set hot loop.

Related issue: #30629
@ncdc
Copy link
Member

ncdc commented Nov 10, 2016

Looks like everything is done.

@ncdc ncdc closed this as completed Nov 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests