-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controllers hotloop on unexpected API server rejections #30629
Comments
@kubernetes/rh-cluster-infra @ncdc probably need to do a survey to find other offenders. I'm still trying to work out what happens to controllers which are denies on pod changes and don't retry. They don't hotloop, but are they stuck until the resync comes? There was a pull to remove resycning and people were declaring their controllers safe, but if I'm reading the code correctly, those people are mistaken. |
Related: #21912 |
I agree we need a roll-up mechanism, but we definitely need to move all pertinent controllers to a rate limited queue in interim. |
/cc |
@foxish did you say you looked at a rate limiting issue recently? Does it relate to this issue? |
@erictune I did fix for petsets, and a bug which affects the rate-limiting work-queue implementation. They're all related in that I think the effort now is just switching the other controllers to use the same work-queue implementation with back-off. |
I would like this to be fixed in Kubernetes 1.5, and will look to get people from our team at Red Hat to fix it. Community help is welcome in the interim. |
Automatic merge from submit-queue update error handling for daemoncontroller Updates the DaemonSet controller to cleanly requeue with ratelimiting on errors, make use of the `utilruntime.HandleError` consistently, and wait for preconditions before doing work. @ncdc @liggitt @sttts My plan is to use this one as an example of how to handle requeuing, preconditions, and processing error handling. @foxish fyi related to #30629
I am interested in this job and I find there are some controllers which are still using
If there is no volunteer, I will be happy to fix them. Of course, before doing it, I would like to receive comments from you :) |
Help would be very welcome. I'd recommend trying to tackle a single controller first. You can take a look at the daemonset controller to see some of the tweaks
|
@deads2k very appreciated! |
@m1093782566 -- please cc me and @deads2k on any PRs you open so we can track the work. A PR per controller you tackle is ideal. Thanks! |
sure |
The first PR for fixing certificate controller hot loop is #32664, PTAL. I will open other PRs if you approve that. |
/cc @rrati |
The 2nd PR for fixing endpoint controller hot loop is #32776, PTAL. |
The 3rd PR for fixing job controller hot loop is #32785, PTAL. |
The 4th PR for fixing disruption controller hot loop is #32850 |
When I look at the code of replication controller and job controller(as they are familiar with replica set controller), I find they both retry on failed pod creates.
So, I assume replica set controller doesn't retry is a wrong behavior? |
Yeah, it's probably wrong or unintentional. |
…er-hotloop Automatic merge from submit-queue [Controller Manager] Fix endpoint controller hot loop and use utilruntime.HandleError to replace glog.Errorf <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **Why**: Fix endpoint controller hot loop and use `utilruntime.HandleError` to replace `glog.Errorf` **What** 1. Fix endpoint controller hot loop in `pkg/controller/endpoint` 2. Fix endpoint controller hot loop in `contrib/mesos/pkg/service` 3. Sweep cases of `glog.Errorf` and use `utilruntime.HandleError` instead. **Which issue this PR fixes** Fixes #32843 Related issue is #30629 **Special notes for your reviewer**: @deads2k @derekwaynecarr The changes on `pkg/controller/endpoints_controller.go` and `contrib/mesos/pkg/service/endpoints_controller.go` are almost the same except `contrib/mesos/pkg/service/endpoints_controller.go` does not pass `podInformer` as the parameter of `NewEndpointController()`. So, I didn't wait `podStoreSynced` before `syncService()`(Just leave it as it was). Will it lead to a problem?
Automatic merge from submit-queue [Controller Manager] Fix certificates controller hotloop and use utilruntime.HandleError to replace glog.Errorf <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it**: Fix certificates controller hotloop on unexpected API server rejections. **Which issue this PR fixes** Related issue is #30629 **Special notes for your reviewer**: @deads2k @derekwaynecarr PTAL. I find there is no unit test for certificates controller, and I will implement unit tests for it later.
Automatic merge from submit-queue [Controller Manager] Fix job controller hot loop <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it:** Fix Job controller hotloop on unexpected API server rejections. **Which issue this PR fixes** Related issue is #30629 **Special notes for your reviewer:** @deads2k @derekwaynecarr PTAL.
Automatic merge from submit-queue fix disruption controller hotloop <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> Fix disruption controller hotloop on unexpected API server rejections. **Which issue this PR fixes** Related issue is #30629 **Special notes for your reviewer**: @deads2k @derekwaynecarr PTAL.
Automatic merge from submit-queue fix replica set hot loop <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it**: Fix replicas set hot loop. Related issue: #30629
Looks like everything is done. |
Many controllers handle retries in an adhoc manner. We now have a
workqueue.RateLimitingInterface
for this purpose.The text was updated successfully, but these errors were encountered: