-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum number of failures or failure backoff policy for Jobs #30243
Comments
@thockin , @maximz re your twitter convo, the docs (http://kubernetes.io/docs/user-guide/jobs/) imply that restartPolicy = Never is supported for jobs, which I found misleading.
If restartPolicy = Never shouldn't be allowed and the goal here is that jobs run until completion, then the docs need to change along with the feature request of max failures / retries. |
@erictune and I authored Jobs, I'll be out for the next two weeks but here are my quick thoughts. Currently |
This is a much needed feature for us as we have several things we would like to run via the Jobs structure, but we cannot as they would get restarted on failure which is not desirable in a few situations for us. |
I support adding a backoff policy to Jobs. I think exponential backoff with a max of like 5min would be fine. I don't think it would be a breaking change to introduce a backoff policy by default. I suspect some users might want fast retry, but we can wait on adding a field until they do. I also support having a "max failed pods to keep around", which would cause the job controller to garbage collect some failed pods before creating new ones. At a minimum, keeping the first and the last failed pod would be useful for debugging. But keeping like 1000 failed pods is usually not useful. Especially if parallelism is 1. I'm not sure if we can change this to be a default, but we can definitely make it a knob. I'd want to discuss it a bit more, but I'd also be open to a field whose meaning is "the job can move to a failure state after this many pod failures". We already have a way to fail after a certain amount of time. I am not very enthusiatic about a "max pod failures before job fails" feature. In particular, I don't know that we can easily guarantee that a job only ever tries one time. |
In this comment: #24533 (comment) Brian Grant says there should be backoff with no knob. |
A default pods object quota per namespace would also help protect the apiserver from ending up with too many jobs. Users could always raise their quota if needed. |
As I understand the pod object limit only applies to non-terminal pods:
http://kubernetes.io/docs/admin/resourcequota/ And currently there's no way to limit the total number of pods. But I guess this would be another issue. |
The completed pods should be cleared by the gc at some point in time. |
What is the current state of this issue? We are really interested in having such behaviour because we are going to use Kubernetes to execute a set of steps of a workflow and if a certain step fails, what we would expect is: the job stops and a option to get a log specifying the failure reason is available. Besides, addressed to @erictune on behalf of his comment about jobs and the apiserver. What would be considered too many jobs for the apiserver? We are basing the execution of the steps of several workflows on Kubernetes. |
@diegodelemos you're welcome to submit a patch with an appropriate fix at any time :) |
Hi @diegodelemos @soltysh, how about this issue? I also have many demands about this. |
@Yancey1989 you're welcome to submit a patch as well 😃 |
@bgrant0607 I'd wish, but I'm out of time :( Just recently somebody opened #41451 but that is completely not what we want. I asked the original submitter to discuss proposed solution here, but haven't heard from him since. |
hi @soltysh @nickschuch a bunch of people on our side (CERN) are also interested in this ( @diegodelemos @rochaporto) and we'd be interested to help. |
No worries, I've blocked out a couple of hours for tomorrow, will have a first crack at the proposal and go from there. @lukasheinrich that would be awesome! |
@lukasheinrich awesome, the more people will give feedback now when we're shaping the design the better! |
I've created this proposal to address the issue: kubernetes/community#583 |
@soltysh any chances to have this implemented? This starts to bite us too many times :) |
I'll do my best, but I can't promise anything, b/c I'm traveling this and next week. |
We are still waiting for this proposal to be agreed upon right? Im happy to write the code (written some already), not very familiar with the procedures sorry |
@sdminonne it looks we're not gonna make it :( sorry |
How about the status about this issue? I'm really want to use this feature, If I want to run a job, I just want to know success or failure with no more try. if always retrying, it will always be failed. |
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy **What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583 **Which issue this PR fixes**: fixes #27997, fixes #30243 **Special notes for your reviewer**: This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController. **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
…8605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
…bernetes#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
A max number of failures or failure backoff policy for Jobs would be useful.
Imagine you have an ETL job in production that's failing due to some pathological input.
By the time you spot it, it's already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever -- and crashed the Kubernetes dashboard in the process.
Today, "restartPolicy" is not respected by Jobs because the goal is to achieve successful completion. (Strangely, "restartPolicy: Never" is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they've been scheduled on. Deletes are rate limited, and in v1.3+, the verbose "you're being throttled" messages are hidden from you when you run the
kubectl
command to delete a job. So it just looks like it's taking forever! This is not a pleasant UX if you have a runaway job in prod or if you're testing out a new job and it takes minutes to clean up after a broken test.What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?
(cc @thockin , who suggested I file this feature request)
Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods -- i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.
However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I'd rather specify a maximum number of retries than a high deadline.
The text was updated successfully, but these errors were encountered: