Maximum number of failures or failure backoff policy for Jobs #30243

maximz · 2016-08-08T20:46:25Z

A max number of failures or failure backoff policy for Jobs would be useful.

Imagine you have an ETL job in production that's failing due to some pathological input.
By the time you spot it, it's already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever -- and crashed the Kubernetes dashboard in the process.

Today, "restartPolicy" is not respected by Jobs because the goal is to achieve successful completion. (Strangely, "restartPolicy: Never" is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they've been scheduled on. Deletes are rate limited, and in v1.3+, the verbose "you're being throttled" messages are hidden from you when you run the kubectl command to delete a job. So it just looks like it's taking forever! This is not a pleasant UX if you have a runaway job in prod or if you're testing out a new job and it takes minutes to clean up after a broken test.

What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?

(cc @thockin , who suggested I file this feature request)

Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods -- i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.

However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I'd rather specify a maximum number of retries than a high deadline.

The text was updated successfully, but these errors were encountered:

cioc · 2016-08-09T18:23:43Z

@thockin , @maximz re your twitter convo, the docs (http://kubernetes.io/docs/user-guide/jobs/) imply that restartPolicy = Never is supported for jobs, which I found misleading.

The first example shows that restartPolicy = Never
The Pod Template section reads, "Only a RestartPolicy equal to Never or OnFailure are allowed."
The Handling Pod and Container failures section reads, "Therefore, your program needs to handle the the case when it is restarted locally, or else specify .spec.template.containers[].restartPolicy = "Never""

If restartPolicy = Never shouldn't be allowed and the goal here is that jobs run until completion, then the docs need to change along with the feature request of max failures / retries.

soltysh · 2016-08-12T12:50:39Z

@erictune and I authored Jobs, I'll be out for the next two weeks but here are my quick thoughts. Currently restartPolicy does not apply to Jobs almost at all, it's passed over to kubelet when creating pods. To address your concern we'd have to come up with a policy for a job which would allow setting logic around how job should react to failures. Overall, I'd like to see such a feature. Not sure how Eric sees that?

sstarcher · 2016-08-24T14:21:18Z

This is a much needed feature for us as we have several things we would like to run via the Jobs structure, but we cannot as they would get restarted on failure which is not desirable in a few situations for us.

erictune · 2016-08-25T00:20:32Z

I support adding a backoff policy to Jobs. I think exponential backoff with a max of like 5min would be fine. I don't think it would be a breaking change to introduce a backoff policy by default. I suspect some users might want fast retry, but we can wait on adding a field until they do.

I also support having a "max failed pods to keep around", which would cause the job controller to garbage collect some failed pods before creating new ones. At a minimum, keeping the first and the last failed pod would be useful for debugging. But keeping like 1000 failed pods is usually not useful. Especially if parallelism is 1. I'm not sure if we can change this to be a default, but we can definitely make it a knob.

I'd want to discuss it a bit more, but I'd also be open to a field whose meaning is "the job can move to a failure state after this many pod failures". We already have a way to fail after a certain amount of time.

I am not very enthusiatic about a "max pod failures before job fails" feature. In particular, I don't know that we can easily guarantee that a job only ever tries one time.

erictune · 2016-08-26T22:54:59Z

In this comment: #24533 (comment) Brian Grant says there should be backoff with no knob.

erictune · 2016-09-01T21:20:24Z

A default pods object quota per namespace would also help protect the apiserver from ending up with too many jobs. Users could always raise their quota if needed.

hanikesn · 2016-10-14T07:58:28Z

As I understand the pod object limit only applies to non-terminal pods:

The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if status.phase in (Failed, Succeeded) is true.

http://kubernetes.io/docs/admin/resourcequota/

And currently there's no way to limit the total number of pods. But I guess this would be another issue.

soltysh · 2016-10-15T10:54:26Z

And currently there's no way to limit the total number of pods. But I guess this would be another issue.

The completed pods should be cleared by the gc at some point in time.

diegodelemos · 2016-10-18T12:22:11Z

What is the current state of this issue? We are really interested in having such behaviour because we are going to use Kubernetes to execute a set of steps of a workflow and if a certain step fails, what we would expect is: the job stops and a option to get a log specifying the failure reason is available.

Besides, addressed to @erictune on behalf of his comment about jobs and the apiserver. What would be considered too many jobs for the apiserver? We are basing the execution of the steps of several workflows on Kubernetes.

soltysh · 2016-10-20T10:08:46Z

@diegodelemos you're welcome to submit a patch with an appropriate fix at any time :)

Yancey1989 · 2017-01-04T06:47:48Z

Hi @diegodelemos @soltysh, how about this issue? I also have many demands about this.

soltysh · 2017-01-04T10:22:34Z

@Yancey1989 you're welcome to submit a patch as well 😃

bgrant0607 · 2017-03-01T00:13:37Z

@soltysh Are you working on this?

cc @janetkuo

soltysh · 2017-03-01T10:37:30Z

@bgrant0607 I'd wish, but I'm out of time :( Just recently somebody opened #41451 but that is completely not what we want. I asked the original submitter to discuss proposed solution here, but haven't heard from him since.

lukasheinrich · 2017-04-05T22:28:49Z

hi @soltysh @nickschuch a bunch of people on our side (CERN) are also interested in this ( @diegodelemos @rochaporto) and we'd be interested to help.

nickschuch · 2017-04-06T05:36:41Z

No worries, I've blocked out a couple of hours for tomorrow, will have a first crack at the proposal and go from there.

@lukasheinrich that would be awesome!

soltysh · 2017-04-06T10:41:05Z

@lukasheinrich awesome, the more people will give feedback now when we're shaping the design the better!

soltysh · 2017-04-26T10:14:40Z

I've created this proposal to address the issue: kubernetes/community#583

sdminonne · 2017-05-17T09:04:28Z

@soltysh any chances to have this implemented? This starts to bite us too many times :)
so... +1

soltysh · 2017-05-18T15:05:00Z

I'll do my best, but I can't promise anything, b/c I'm traveling this and next week.

nickschuch · 2017-05-19T08:48:57Z

We are still waiting for this proposal to be agreed upon right? Im happy to write the code (written some already), not very familiar with the procedures sorry

soltysh · 2017-05-19T19:31:56Z

@sdminonne it looks we're not gonna make it :( sorry

PeiKevin · 2017-08-15T09:59:15Z

How about the status about this issue? I'm really want to use this feature, If I want to run a job, I just want to know success or failure with no more try. if always retrying, it will always be failed.

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

soltysh · 2017-08-30T11:22:41Z

API PR: #48075
Controller PR: #51153

Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy **What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583 **Which issue this PR fixes**: fixes #27997, fixes #30243 **Special notes for your reviewer**: This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController. **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

…8605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

…bernetes#18605) Sometimes, as it happened to me, a Pod's `restartPolicy` is mistakenly taken as the corresponding Job's restart policy. That was concluded before, here: https://github.com/kubernetes/community/pull/583/files The confusion happened here: kubernetes/kubernetes#30243 kubernetes/kubernetes#43964 And here: jaegertracing/jaeger-kubernetes#32 This commit tries to clarify that there is no `restartPolicy` for the job itself, and that using either of `backoffLimit` and `activeDeadlineSeconds` may result in permanent failure.

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

k8s-github-robot added the [Errno 2] No such file or directory: './models/trained_teams_model.pkl' label Aug 8, 2016

grodrigues3 removed the [Errno 2] No such file or directory: './models/trained_teams_model.pkl' label Aug 9, 2016

k8s-github-robot added area/kubectl team/control-plane labels Aug 9, 2016

soltysh self-assigned this Aug 12, 2016

soltysh added kind/enhancement area/batch labels Aug 12, 2016

erictune mentioned this issue Aug 26, 2016

Add doomed-to-failure detection to Job #25254

Closed

soltysh mentioned this issue Sep 26, 2016

Better support for sidecar containers in batch jobs #25908

Closed

diegodelemos mentioned this issue Oct 19, 2016

Kubernetes Jobs restart containers which execution has failed diegodelemos/cap-reuse#3

Closed

Yancey1989 mentioned this issue Jan 4, 2017

How about supports ETL job scheduler? #39411

Closed

btipling mentioned this issue Jan 8, 2017

There's no way to stop a job from restarting forever on failure, documentation is wrong #39578

Closed

soltysh mentioned this issue Feb 15, 2017

Job: Failure Threshold #41451

Closed

bgrant0607 added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed team/control-plane (deprecated - do not use) area/kubectl labels Mar 1, 2017

soltysh mentioned this issue Apr 7, 2017

Case study: Prow Jobs (and why Jobs are not enough for k8s testing) #43964

Closed

soltysh mentioned this issue Apr 26, 2017

Backoff policy and failed pod limit kubernetes/community#583

Merged

clamoriniere1A mentioned this issue Aug 31, 2017

[API] Feature/job failure policy #48075

Merged

k8s-github-robot closed this as completed in #48075 Sep 3, 2017

soltysh mentioned this issue Sep 8, 2017

Failure policy for Jobs kubernetes/enhancements#298

Closed

jbguerraz mentioned this issue Nov 16, 2017

Kubernetes 1.8 - BackoffLimit skilld-labs/rundeck-kubernetes-step-plugin#9

Open

okartau mentioned this issue Feb 13, 2018

Spinning up of large number of erroneous pods kubernetes-sigs/node-feature-discovery#64

Closed

jgehrcke mentioned this issue Jan 10, 2020

Reschedule jaeger-cassandra-schema-job upon failure jaegertracing/jaeger-kubernetes#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum number of failures or failure backoff policy for Jobs #30243

Maximum number of failures or failure backoff policy for Jobs #30243

maximz commented Aug 8, 2016 •

edited

Loading

cioc commented Aug 9, 2016

soltysh commented Aug 12, 2016

sstarcher commented Aug 24, 2016

erictune commented Aug 25, 2016

erictune commented Aug 26, 2016

erictune commented Sep 1, 2016

hanikesn commented Oct 14, 2016 •

edited

Loading

soltysh commented Oct 15, 2016

diegodelemos commented Oct 18, 2016

soltysh commented Oct 20, 2016 •

edited

Loading

Yancey1989 commented Jan 4, 2017

soltysh commented Jan 4, 2017

bgrant0607 commented Mar 1, 2017

soltysh commented Mar 1, 2017

lukasheinrich commented Apr 5, 2017

nickschuch commented Apr 6, 2017

soltysh commented Apr 6, 2017

soltysh commented Apr 26, 2017

sdminonne commented May 17, 2017

soltysh commented May 18, 2017

nickschuch commented May 19, 2017

soltysh commented May 19, 2017

PeiKevin commented Aug 15, 2017

soltysh commented Aug 30, 2017

Maximum number of failures or failure backoff policy for Jobs #30243

Maximum number of failures or failure backoff policy for Jobs #30243

Comments

maximz commented Aug 8, 2016 • edited Loading

cioc commented Aug 9, 2016

soltysh commented Aug 12, 2016

sstarcher commented Aug 24, 2016

erictune commented Aug 25, 2016

erictune commented Aug 26, 2016

erictune commented Sep 1, 2016

hanikesn commented Oct 14, 2016 • edited Loading

soltysh commented Oct 15, 2016

diegodelemos commented Oct 18, 2016

soltysh commented Oct 20, 2016 • edited Loading

Yancey1989 commented Jan 4, 2017

soltysh commented Jan 4, 2017

bgrant0607 commented Mar 1, 2017

soltysh commented Mar 1, 2017

lukasheinrich commented Apr 5, 2017

nickschuch commented Apr 6, 2017

soltysh commented Apr 6, 2017

soltysh commented Apr 26, 2017

sdminonne commented May 17, 2017

soltysh commented May 18, 2017

nickschuch commented May 19, 2017

soltysh commented May 19, 2017

PeiKevin commented Aug 15, 2017

soltysh commented Aug 30, 2017

maximz commented Aug 8, 2016 •

edited

Loading

hanikesn commented Oct 14, 2016 •

edited

Loading

soltysh commented Oct 20, 2016 •

edited

Loading