Job: Failure Threshold #41451

nickschuch · 2017-02-15T01:38:45Z

What this PR does / why we need it:

Jobs current implementation will keep recreating a pod as it fails to ensure that the job runs to completion.

This isn't always the ideal, sometimes we want our build to fail and be recreated at a much later date (or not at all) by a higher level API eg. CronJobs.

My solution, implement a "Failure Threshold".

This means that jobs will be allowed to have X number of failures before the Job is stopped entirely. This becomes really handy when using the CronJobs API and you know if the job fails you will get a new job at a later date.

Which issue this PR fixes:

Special notes for your reviewer:

A Job with a Failure Threshold can be created with the following:

apiVersion: batch/v1
kind: Job
metadata:
  name: threshold-test
spec:
  failureThreshold: 2
  template:
    metadata:
      name: threshold-test
    spec:
      containers:
      - name: failure
        image: alpine
        command: ["exit",  "1"]

Release note:

Adds failureThreshold field to Jobs. Stops Job from running after X failed executions.

k8s-ci-robot · 2017-02-15T01:38:45Z

Hi @nickschuch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2017-02-15T01:38:47Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-02-15T01:38:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: nickschuch

Needs approval from an approver in each of these OWNERS Files:

pkg/OWNERS
test/OWNERS

We suggest the following people:
cc @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-02-15T01:38:50Z

@nickschuch PR needs rebase

k8s-reviewable · 2017-02-15T01:38:50Z

This change is

smarterclayton · 2017-02-15T03:09:00Z

@kubernetes/sig-apps-feature-requests @kubernetes/sig-apps-pr-reviews

smarterclayton · 2017-02-15T03:10:23Z

@erictune in case there was an issue that discussed this previously (I couldn't find one from our original job threads, but my memory isn't what it used to be)

soltysh · 2017-02-15T11:20:06Z

@smarterclayton you're asking about this #30243

smarterclayton · 2017-02-15T15:09:18Z

Thank you. Agree with Eric's concern that "failures=1" is deceptive and we do not guarantee that. That's my primary concern with this, that it is obvious that these are best effort. On Feb 15, 2017, at 6:20 AM, Maciej Szulik <notifications@github.com> wrote: @smarterclayton <https://github.com/smarterclayton> you're asking about this #30243 <#30243> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41451 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pykp9DGnqeCXsAid51D_HBbiO8lGks5rct9vgaJpZM4MBNa9> .

soltysh · 2017-02-15T16:07:22Z

I haven't looked into this PR, yet. I'm reserving tomorrow's morning for a thorough review.

soltysh · 2017-02-16T13:58:56Z

pkg/apis/batch/v1/types.go

+	// to finish once the threshold is met.
+	// More info: http://kubernetes.io/docs/user-guide/jobs
+	// +optional
+	FailureThreshold *int32 `json:"failureThreshold,omitempty" protobuf:"varint,2,opt,name=failureThreshold"`


Per this and this I'm imaging this allowing you set different policies, to start with:

count failures from beginning, iow. I don't care when the failures started, but as soon as they hit X we can fail a job

count failures from certain point (probably configurable, eg. X successful), and Y failures after that fails the job

be able to specify whether to clear the counter if successful run happens in the mean time

@erictune we've never actually specified how this could be implemented, do you have something on my your mind for this or what I've mentioned could be a good place to start with?

I need a bit more detail around this, Im happy to discuss this in any avenue, I thought we were waiting on some input from @erictune .

Let's have a discussion about possible solutions in the original issue. This is API change and it requires a bit more thought before we do so.

soltysh · 2017-02-16T14:01:18Z

@nickschuch let's discuss possible approaches to this in #30243, where others already spoke up. Once we have a general agreement we can proceed with actual implementation. The current solution with just a threshold is definitely not sufficient.

soltysh · 2017-04-26T14:15:32Z

Discussing the proper solution in kubernetes/community#583, I'm closing this one for now. Please open a new one with what we agree on in the proposal.

nickschuch added 2 commits February 15, 2017 01:23

Adds FailureThreshold for Jobs

decdcd8

Generated: Adds FailureThreshold for Jobs

e22420b

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 15, 2017

k8s-github-robot assigned eparis Feb 15, 2017

smarterclayton assigned soltysh and unassigned eparis Feb 15, 2017

soltysh requested changes Feb 16, 2017

View reviewed changes

soltysh mentioned this pull request Mar 1, 2017

Maximum number of failures or failure backoff policy for Jobs #30243

Closed

bradrydzewski mentioned this pull request Mar 20, 2017

Provide (optional) ability to use Kubernetes as the runtime engine instead of Docker harness/harness#1815

Closed

grodrigues3 added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 5, 2017

soltysh closed this Apr 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job: Failure Threshold #41451

Job: Failure Threshold #41451

nickschuch commented Feb 15, 2017

k8s-ci-robot commented Feb 15, 2017

k8s-ci-robot commented Feb 15, 2017

k8s-github-robot commented Feb 15, 2017

k8s-github-robot commented Feb 15, 2017

k8s-reviewable commented Feb 15, 2017

smarterclayton commented Feb 15, 2017

smarterclayton commented Feb 15, 2017

soltysh commented Feb 15, 2017

smarterclayton commented Feb 15, 2017 via email

soltysh commented Feb 15, 2017

soltysh Feb 16, 2017

nickschuch Mar 22, 2017

soltysh Mar 28, 2017

soltysh commented Feb 16, 2017

soltysh commented Apr 26, 2017

Job: Failure Threshold #41451

Job: Failure Threshold #41451

Conversation

nickschuch commented Feb 15, 2017

k8s-ci-robot commented Feb 15, 2017

k8s-ci-robot commented Feb 15, 2017

k8s-github-robot commented Feb 15, 2017

k8s-github-robot commented Feb 15, 2017

k8s-reviewable commented Feb 15, 2017

smarterclayton commented Feb 15, 2017

smarterclayton commented Feb 15, 2017

soltysh commented Feb 15, 2017

smarterclayton commented Feb 15, 2017 via email

soltysh commented Feb 15, 2017

soltysh Feb 16, 2017

Choose a reason for hiding this comment

nickschuch Mar 22, 2017

Choose a reason for hiding this comment

soltysh Mar 28, 2017

Choose a reason for hiding this comment

soltysh commented Feb 16, 2017

soltysh commented Apr 26, 2017