Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job: Failure Threshold #41451

Closed
wants to merge 2 commits into from
Closed

Job: Failure Threshold #41451

wants to merge 2 commits into from

Conversation

nickschuch
Copy link

What this PR does / why we need it:

Jobs current implementation will keep recreating a pod as it fails to ensure that the job runs to completion.

This isn't always the ideal, sometimes we want our build to fail and be recreated at a much later date (or not at all) by a higher level API eg. CronJobs.

My solution, implement a "Failure Threshold".

This means that jobs will be allowed to have X number of failures before the Job is stopped entirely. This becomes really handy when using the CronJobs API and you know if the job fails you will get a new job at a later date.

Which issue this PR fixes:

Special notes for your reviewer:

A Job with a Failure Threshold can be created with the following:

apiVersion: batch/v1
kind: Job
metadata:
  name: threshold-test
spec:
  failureThreshold: 2
  template:
    metadata:
      name: threshold-test
    spec:
      containers:
      - name: failure
        image: alpine
        command: ["exit",  "1"]

Release note:

Adds failureThreshold field to Jobs. Stops Job from running after X failed executions.

@k8s-ci-robot
Copy link
Contributor

Hi @nickschuch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 15, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: nickschuch

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

@nickschuch PR needs rebase

@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 15, 2017
@smarterclayton smarterclayton assigned soltysh and unassigned eparis Feb 15, 2017
@smarterclayton
Copy link
Contributor

@kubernetes/sig-apps-feature-requests @kubernetes/sig-apps-pr-reviews

@smarterclayton
Copy link
Contributor

@erictune in case there was an issue that discussed this previously (I couldn't find one from our original job threads, but my memory isn't what it used to be)

@soltysh
Copy link
Contributor

soltysh commented Feb 15, 2017

@smarterclayton you're asking about this #30243

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 15, 2017 via email

@soltysh
Copy link
Contributor

soltysh commented Feb 15, 2017

I haven't looked into this PR, yet. I'm reserving tomorrow's morning for a thorough review.

// to finish once the threshold is met.
// More info: http://kubernetes.io/docs/user-guide/jobs
// +optional
FailureThreshold *int32 `json:"failureThreshold,omitempty" protobuf:"varint,2,opt,name=failureThreshold"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this and this I'm imaging this allowing you set different policies, to start with:

  1. count failures from beginning, iow. I don't care when the failures started, but as soon as they hit X we can fail a job
  2. count failures from certain point (probably configurable, eg. X successful), and Y failures after that fails the job
  3. be able to specify whether to clear the counter if successful run happens in the mean time

@erictune we've never actually specified how this could be implemented, do you have something on my your mind for this or what I've mentioned could be a good place to start with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a bit more detail around this, Im happy to discuss this in any avenue, I thought we were waiting on some input from @erictune .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a discussion about possible solutions in the original issue. This is API change and it requires a bit more thought before we do so.

@soltysh
Copy link
Contributor

soltysh commented Feb 16, 2017

@nickschuch let's discuss possible approaches to this in #30243, where others already spoke up. Once we have a general agreement we can proceed with actual implementation. The current solution with just a threshold is definitely not sufficient.

@soltysh
Copy link
Contributor

soltysh commented Apr 26, 2017

Discussing the proper solution in kubernetes/community#583, I'm closing this one for now. Please open a new one with what we agree on in the proposal.

@soltysh soltysh closed this Apr 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants