Retriable and non-retriable Pod failures for Jobs #3329
Closed
Description
Enhancement Description
-
One-line enhancement description (can be used as a release note): An API to influence retries based on exit codes and/or pod deletion reasons.
-
Kubernetes Enhancement Proposal: https://git.k8s.io/enhancements/keps/sig-apps/3329-retriable-and-non-retriable-failures
-
Discussion Link: RFE: ability to define special exit code to terminate existing job kubernetes#17244
-
Primary contact (assignee): @alculquicondor
-
Responsible SIGs: apps, api-machinery, scheduling
-
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.25
- Beta release target (x.y): 1.26
- Stable release target (x.y): 1.31
-
Alpha
- KEP (
k/enhancements
) update PR(s):- KEP-3329 Add KEP for Retriable and non-retriable Pod failures for Jobs #3374
- Update KEP-3329 "Retriable and non-retriable Pod failures for Jobs" #3438
- Additional update for KEP-3329 "Retriable and non-retriable Pod failures for Jobs" #3447
- Updates to KEP-3329 "Retriable and non-retriable Pod failures for Jobs" #3452
- Code (
k/k
) update PR(s):- Refactor gc_controller to do not use the deletePod stub kubernetes#111070
- Refactor taint_manager to do not use getPod and getNode stubs kubernetes#111084
- Add integration test for podgc kubernetes#111091
- Append new pod conditions when deleting pods to indicate the reason for pod deletion kubernetes#110959
- Support handling of pod failures with respect to the configured rules kubernetes#111113
- Add worker to clean up stale DisruptionTarget condition kubernetes#111475
- Docs (
k/website
) update PR(s): Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobs website#35219
- KEP (
-
Beta
- KEP (
k/enhancements
) update PR(s):- Update KEP-3329 "Retriable and non-retriable Pod failures for Jobs" for Beta #3463
- Update for "Retriable and non-retriable Pod failures for Jobs" #3646
- Testgrid links to e2e tests for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3769
- Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757
- v1.28
- v1.30
- Code (
k/k
) update PR(s):- Add pod disruption conditions for kubelet-initiated failures kubernetes#112360
- Extend metrics with the new labels kubernetes#113324
- Use SSA to add pod failure conditions kubernetes#113304
- Enable the "Retriable and non-retriable pod failures for jobs" feature into beta kubernetes#113360
- Add e2e test for job pod failure policy used to match pod disruption kubernetes#113812
- Fix disruption controller permissions to allow patching pod's status kubernetes#113580
- Fix match onExitCodes when Pod is not terminated kubernetes#113856
- Wait for Pods to finish before considering Failed in Job kubernetes#113860
- Add e2e test to ignore failures with 137 exit code kubernetes#113927
- Fix clearing of rate-limiter for the queue of checks for cleaning stale pod disruption conditions kubernetes#114770
- Adjust DisruptionTarget condition message to do not include preemptor pod metadata kubernetes#114914
- PodGC should not add DisruptionTarget condition for pods which are in terminal phase kubernetes#115056
- Give terminal phase correctly to all pods that will not be restarted kubernetes#115331
- API-initiated eviction: handle deleteOptions correctly kubernetes#116554
- Add DisruptionTarget condition when preempting for critical pod kubernetes#117586
- Job: create replacement pods only after terminated kubernetes#117015
- Use Patch instead of SSA for Pod Disruption condition kubernetes#121103
- Docs (
k/website
) update(s):- Promote "Retriable and non-retriable pod failures for Jobs" to Beta website#37242
- Document for "Wait for Pods to finish before considering Failed in Job" website#38040
- Extend documentation on PodGC focusing on PodDisruptionConditions enabled website#38042
- Update docs for KEP3329: "Retriable and non-retriable Pod failures for jobs website#39809
- Add information about PodReplacementPolicy in Job API website#41745
- KEP (
-
Stable
- KEP (
k/enhancements
) update PR(s): Graduate Job Pod Failure Policy to stable #4661 - Code (
k/k
) update PR(s):- scheduler: Test that the DisruptionTarget condition is added at preemption time kubernetes#125533
- Graduate JobPodFailurePolicy to stable kubernetes#125442
- Graduate PodDisruptionConditions to stable kubernetes#125461
- Promote JobPodFailurePolicy and PodDisruptionConditions e2e tests to Conformance kubernetes#125482
- Use omitempty for optional fields in Job Pod Failure Policy kubernetes#126046
- Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected kubernetes#126644
- clean up codes after PodDisruptionConditions was promoted to GA kubernetes#125994
- cleanup after JobPodFailurePolicy is promoted to GA kubernetes#126102
- Docs (
k/website
) update(s):
- KEP (
Metadata
Assignees
Labels
Denotes that an issue has been opted in to a releaseCategorizes an issue or PR as relevant to SIG API Machinery.Categorizes an issue or PR as relevant to SIG Apps.Categorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Scheduling.Denotes an issue tracking an enhancement targeted for Stable/GA statusDenotes an enhancement issue is actively being tracked by the Release TeamCategorizes an issue or PR as relevant to WG Batch.
Type
Projects
Status
New New
Status
Graduating
Status
Tracked
Status
Tracked
Status
Closed
Status
Closed