-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart job on failure for Always,OnFailure Policy #1572
Conversation
Hi @georgkaleido. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Pull Request Test Coverage Report for Build 2467493465
💛 - Coveralls |
LGTM |
I think so, it seems that the patch should be applied to all the operators. /cc @kubeflow/wg-training-leads |
+1 |
@georgkaleido do you have an image with this fix applied? I'm facing a similar issue and would love to give it a try. |
Can you do a rebase? |
I've been testing this PR on preemptible GKE nodes and it does bring back failed pods from preempt events (simulated using |
@georgkaleido Can you please rebase the PR ? We would like this get merged for next release |
@johnugeorge do you know when is the next release scheduled for? Just asking as I need to decide whether to wait for the next release or run this fix on my own until it's made available. |
June 15th is the feature freeze and we can cut RC release within that week if things go smoothly. If the PR author doesn't respond in this week, can you create a PR ? |
👍
Yup, I can do that. |
@johnugeorge if you don't mind I can take over the work done here and send a new PR? Is that ok? |
@johnugeorge see #1605 |
Hi sry, will rebase ASAP |
Fixes kubeflow#1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
@johnugeorge Done |
sry for the late reponse: there is an image to test at kaleidoai/kubeflow-training-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: georgkaleido, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
There can be pod level failures caused by the system, which would previously caused the entire job to fail on all policies except ExitCode. See also #1570
Works together with kubeflow/common#189
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #1570
Checklist: