Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Feature/job failure policy #48075

Merged

Conversation

clamoriniere1A
Copy link
Contributor

@clamoriniere1A clamoriniere1A commented Jun 26, 2017

What this PR does / why we need it: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583

Which issue this PR fixes:
fixes #27997, fixes #30243

Special notes for your reviewer:
This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController.

Release note:

Add backoff policy and failed pod limit for a job

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 26, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @clamoriniere1A. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 26, 2017
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-label-needed labels Jun 26, 2017
@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jun 26, 2017
@clamoriniere1A
Copy link
Contributor Author

CLA signed

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2017
@spiffxp
Copy link
Member

spiffxp commented Jun 30, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 30, 2017
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy branch from cc2c9bd to 8960d0c Compare July 4, 2017 14:37
@resouer resouer requested review from sttts and caesarxuchao July 17, 2017 13:25
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy branch from 8960d0c to 9c9ef60 Compare July 20, 2017 12:37
@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 20, 2017
@clamoriniere1A
Copy link
Contributor Author

clamoriniere1A commented Jul 20, 2017

Hi @soltysh

This PR is linked to the proposal “Backoff policy and failed pod limit”: kubernetes/community#583

I started to update the jobcontroller.syncJob() logic, and I have 2 questions regarding the attended behaviours:
1) Do we need to delete all the active running Pods in case the Backofflimit have been reached? like it is done when the Job cross the “Job Deadine”.
(code link: https://github.com/clamoriniere1A/kubernetes/blob/feature/job_failure_policy/pkg/controller/job/jobcontroller.go#L521 )

2) In order to handle properly the deletion of failed pods when the Job.Spec. FailedPodsLimit is reached.
I needed to add a new counter FailedAndDeleted in the Job.Status struct ( https://github.com/clamoriniere1A/kubernetes/blob/feature/job_failure_policy/pkg/apis/batch/types.go#L179 ). The new counter allows me to calculate the total number of failed Pod: “Failed and Active” + “Failed and deleted”.
Do you think it is a good approach?

@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy branch 2 times, most recently from af2642f to ee9db60 Compare July 20, 2017 15:48
@smarterclayton
Copy link
Contributor

Based on reading of the proposal and conventions this is API approved. The PR that enables failedpodslimit will still need API review, although I confirmed it matches the proposal.

/approve

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2017
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@soltysh soltysh added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Aug 31, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2017
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy branch from 37bf009 to 044edac Compare September 1, 2017 18:56
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 1, 2017
@soltysh soltysh removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Sep 1, 2017
Add new fields in api v1.JobSpec object for backoff policy
- BackoffLimit
- FailedPodsLimit

fixes: kubernetes/community#583
This commit contains the new version of generated api files linked
to the v1.JobSpec modifications in the previous commit after
"make update"
@clamoriniere1A clamoriniere1A force-pushed the feature/job_failure_policy branch from 044edac to 2286936 Compare September 1, 2017 19:01
@soltysh
Copy link
Contributor

soltysh commented Sep 1, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere1A, smarterclayton, soltysh

Associated issue: 27997

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@soltysh
Copy link
Contributor

soltysh commented Sep 1, 2017

/test pull-kubernetes-kubemark-e2e-gce

@soltysh
Copy link
Contributor

soltysh commented Sep 2, 2017

/retest

@CaoShuFeng
Copy link
Contributor

/test pull-kubernetes-verify

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

@clamoriniere1A: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 2286936 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920)

@k8s-github-robot k8s-github-robot merged commit 73ed961 into kubernetes:master Sep 3, 2017
@0xmichalis
Copy link
Contributor

Is there a PR with the controller changes?

@clamoriniere
Copy link
Contributor

@Kargakis, yes #51153 contains controller changes

@0xmichalis
Copy link
Contributor

merci

k8s-github-robot pushed a commit that referenced this pull request Sep 3, 2017
…icy_controller

Automatic merge from submit-queue

Job failure policy controller support

**What this PR does / why we need it**:
Start implementing the support of the "Backoff policy and failed pod limit" in the ```JobController```  defined in kubernetes/community#583.
This PR depends on a previous PR #48075  that updates the K8s API types.

TODO: 
* [X] Implement ```JobSpec.BackoffLimit``` support
* [x] Rebase when #48075 has been merged.
* [X] Implement end2end tests



implements kubernetes/community#583

**Special notes for your reviewer**:

**Release note**:
```release-note
Add backoff policy and failed pod limit for a job
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maximum number of failures or failure backoff policy for Jobs Case report: Job somehow made 84k pods