You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a PyTorchJob with restartPolicy=OnFailure is created, it won't handle all errors correctly.
In case the worker pod status is in Status=Failed due to a graceful node shutdown it won't get recreated, but instead the the PytorchJob while fail in this block:
Fixeskubeflow#1570
Together with kubeflow/common#189
There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
Fixeskubeflow#1570
Together with kubeflow/common#189
There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
Fixes#1570
Together with kubeflow/common#189
There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
When a PyTorchJob with
restartPolicy=OnFailure
is created, it won't handle all errors correctly.In case the worker pod status is in
Status=Failed
due to a graceful node shutdown it won't get recreated, but instead the the PytorchJob while fail in this block:training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go
Lines 431 to 443 in 8c43231
I think a fix would be to recreate the pod if policy is OnFailure here:
https://github.com/kubeflow/common/blob/2b40c8f8991e302920ee5536c0ad49dec6724c66/pkg/controller.v1/common/pod.go#L350-L360
I would go ahead an create a PR for this unless there is another way to go about this.
Sample job:
Can be tested with preemptible nodes on GKE
The text was updated successfully, but these errors were encountered: