Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum number of failures or failure backoff policy for Jobs #30243

Closed
maximz opened this issue Aug 8, 2016 · 31 comments · Fixed by #48075
Closed

Maximum number of failures or failure backoff policy for Jobs #30243

maximz opened this issue Aug 8, 2016 · 31 comments · Fixed by #48075
Assignees
Labels
area/batch sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@maximz
Copy link

maximz commented Aug 8, 2016

A max number of failures or failure backoff policy for Jobs would be useful.

Imagine you have an ETL job in production that's failing due to some pathological input.
By the time you spot it, it's already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever -- and crashed the Kubernetes dashboard in the process.

Today, "restartPolicy" is not respected by Jobs because the goal is to achieve successful completion. (Strangely, "restartPolicy: Never" is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they've been scheduled on. Deletes are rate limited, and in v1.3+, the verbose "you're being throttled" messages are hidden from you when you run the kubectl command to delete a job. So it just looks like it's taking forever! This is not a pleasant UX if you have a runaway job in prod or if you're testing out a new job and it takes minutes to clean up after a broken test.

What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?

(cc @thockin , who suggested I file this feature request)


Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods -- i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.

However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I'd rather specify a maximum number of retries than a high deadline.

@cioc
Copy link

cioc commented Aug 9, 2016

@thockin , @maximz re your twitter convo, the docs (http://kubernetes.io/docs/user-guide/jobs/) imply that restartPolicy = Never is supported for jobs, which I found misleading.

  1. The first example shows that restartPolicy = Never
  2. The Pod Template section reads, "Only a RestartPolicy equal to Never or OnFailure are allowed."
  3. The Handling Pod and Container failures section reads, "Therefore, your program needs to handle the the case when it is restarted locally, or else specify .spec.template.containers[].restartPolicy = "Never""

If restartPolicy = Never shouldn't be allowed and the goal here is that jobs run until completion, then the docs need to change along with the feature request of max failures / retries.

@soltysh soltysh self-assigned this Aug 12, 2016
@soltysh
Copy link
Contributor

soltysh commented Aug 12, 2016

@erictune and I authored Jobs, I'll be out for the next two weeks but here are my quick thoughts. Currently restartPolicy does not apply to Jobs almost at all, it's passed over to kubelet when creating pods. To address your concern we'd have to come up with a policy for a job which would allow setting logic around how job should react to failures. Overall, I'd like to see such a feature. Not sure how Eric sees that?

@sstarcher
Copy link

This is a much needed feature for us as we have several things we would like to run via the Jobs structure, but we cannot as they would get restarted on failure which is not desirable in a few situations for us.

@erictune
Copy link
Member

I support adding a backoff policy to Jobs. I think exponential backoff with a max of like 5min would be fine. I don't think it would be a breaking change to introduce a backoff policy by default. I suspect some users might want fast retry, but we can wait on adding a field until they do.

I also support having a "max failed pods to keep around", which would cause the job controller to garbage collect some failed pods before creating new ones. At a minimum, keeping the first and the last failed pod would be useful for debugging. But keeping like 1000 failed pods is usually not useful. Especially if parallelism is 1. I'm not sure if we can change this to be a default, but we can definitely make it a knob.

I'd want to discuss it a bit more, but I'd also be open to a field whose meaning is "the job can move to a failure state after this many pod failures". We already have a way to fail after a certain amount of time.

I am not very enthusiatic about a "max pod failures before job fails" feature. In particular, I don't know that we can easily guarantee that a job only ever tries one time.

@erictune
Copy link
Member

In this comment: #24533 (comment) Brian Grant says there should be backoff with no knob.

@erictune
Copy link
Member

erictune commented Sep 1, 2016

A default pods object quota per namespace would also help protect the apiserver from ending up with too many jobs. Users could always raise their quota if needed.

@hanikesn
Copy link

hanikesn commented Oct 14, 2016

As I understand the pod object limit only applies to non-terminal pods:

The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if status.phase in (Failed, Succeeded) is true.

http://kubernetes.io/docs/admin/resourcequota/

And currently there's no way to limit the total number of pods. But I guess this would be another issue.

@soltysh
Copy link
Contributor

soltysh commented Oct 15, 2016

And currently there's no way to limit the total number of pods. But I guess this would be another issue.

The completed pods should be cleared by the gc at some point in time.

@diegodelemos
Copy link

What is the current state of this issue? We are really interested in having such behaviour because we are going to use Kubernetes to execute a set of steps of a workflow and if a certain step fails, what we would expect is: the job stops and a option to get a log specifying the failure reason is available.

Besides, addressed to @erictune on behalf of his comment about jobs and the apiserver. What would be considered too many jobs for the apiserver? We are basing the execution of the steps of several workflows on Kubernetes.

@soltysh
Copy link
Contributor

soltysh commented Oct 20, 2016

@diegodelemos you're welcome to submit a patch with an appropriate fix at any time :)

@Yancey1989
Copy link
Contributor

Hi @diegodelemos @soltysh, how about this issue? I also have many demands about this.

@soltysh
Copy link
Contributor

soltysh commented Jan 4, 2017

@Yancey1989 you're welcome to submit a patch as well 😃

@bgrant0607
Copy link
Member

@soltysh Are you working on this?

cc @janetkuo

@soltysh
Copy link
Contributor

soltysh commented Mar 1, 2017

@bgrant0607 I'd wish, but I'm out of time :( Just recently somebody opened #41451 but that is completely not what we want. I asked the original submitter to discuss proposed solution here, but haven't heard from him since.

@lukasheinrich
Copy link

hi @soltysh @nickschuch a bunch of people on our side (CERN) are also interested in this ( @diegodelemos @rochaporto) and we'd be interested to help.

@nickschuch
Copy link

No worries, I've blocked out a couple of hours for tomorrow, will have a first crack at the proposal and go from there.

@lukasheinrich that would be awesome!

@soltysh
Copy link
Contributor

soltysh commented Apr 6, 2017

@lukasheinrich awesome, the more people will give feedback now when we're shaping the design the better!

@soltysh
Copy link
Contributor

soltysh commented Apr 26, 2017

I've created this proposal to address the issue: kubernetes/community#583

@sdminonne
Copy link
Contributor

@soltysh any chances to have this implemented? This starts to bite us too many times :)
so... +1

@soltysh
Copy link
Contributor

soltysh commented May 18, 2017

I'll do my best, but I can't promise anything, b/c I'm traveling this and next week.

@nickschuch
Copy link

We are still waiting for this proposal to be agreed upon right? Im happy to write the code (written some already), not very familiar with the procedures sorry

@soltysh
Copy link
Contributor

soltysh commented May 19, 2017

@sdminonne it looks we're not gonna make it :( sorry

@PeiKevin
Copy link

How about the status about this issue? I'm really want to use this feature, If I want to run a job, I just want to know success or failure with no more try. if always retrying, it will always be failed.

k8s-github-robot pushed a commit to kubernetes/community that referenced this issue Aug 28, 2017
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
@soltysh
Copy link
Contributor

soltysh commented Aug 30, 2017

API PR: #48075
Controller PR: #51153

k8s-github-robot pushed a commit that referenced this issue Sep 3, 2017
Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920)

[API] Feature/job failure policy

**What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583

**Which issue this PR fixes**: 
fixes #27997, fixes #30243

**Special notes for your reviewer**:
This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController.

**Release note**:
```release-note
Add backoff policy and failed pod limit for a job
```
justaugustus pushed a commit to justaugustus/enhancements that referenced this issue Sep 3, 2018
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
jgehrcke added a commit to jgehrcke/website that referenced this issue Jan 10, 2020
Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
k8s-ci-robot pushed a commit to kubernetes/website that referenced this issue Jan 15, 2020
…8605)

Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
wawa0210 pushed a commit to wawa0210/website that referenced this issue Mar 2, 2020
…bernetes#18605)

Sometimes, as it happened to me, a Pod's `restartPolicy` 
is mistakenly taken as the corresponding Job's restart policy.

That was concluded before, here:
https://github.com/kubernetes/community/pull/583/files

The confusion happened here:
kubernetes/kubernetes#30243
kubernetes/kubernetes#43964

And here:
jaegertracing/jaeger-kubernetes#32

This commit tries to clarify that there is no `restartPolicy` for
the job itself, and that using either of `backoffLimit` and
`activeDeadlineSeconds` may result in permanent failure.
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Nov 30, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

Backoff policy and failed pod limit

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/batch sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.