Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

Closed
georgkaleido opened this issue Apr 6, 2022 · 0 comments · Fixed by #1572
Closed

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

georgkaleido opened this issue Apr 6, 2022 · 0 comments · Fixed by #1572

Comments

@georgkaleido
Copy link
Contributor

When a PyTorchJob with restartPolicy=OnFailure is created, it won't handle all errors correctly.

In case the worker pod status is in Status=Failed due to a graceful node shutdown it won't get recreated, but instead the the PytorchJob while fail in this block:

} else {
msg := fmt.Sprintf("PyTorchJob %s is failed because %d %s replica(s) failed.", pytorchjob.Name, failed, rtype)
r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobFailedReason, msg)
if pytorchjob.Status.CompletionTime == nil {
now := metav1.Now()
pytorchjob.Status.CompletionTime = &now
}
err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg)
if err != nil {
commonutil.LoggerForJob(pytorchjob).Infof("Append job condition error: %v", err)
return err
}
trainingoperatorcommon.FailedJobsCounterInc(pytorchjob.Namespace, pytorchv1.FrameworkName)

I think a fix would be to recreate the pod if policy is OnFailure here:
https://github.com/kubeflow/common/blob/2b40c8f8991e302920ee5536c0ad49dec6724c66/pkg/controller.v1/common/pod.go#L350-L360

I would go ahead an create a PR for this unless there is another way to go about this.

Sample job:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: restart-failure
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: python
              tty: true
              stdin: true

Can be tested with preemptible nodes on GKE

georgkaleido added a commit to georgkaleido/training-operator that referenced this issue Apr 8, 2022
Fixes kubeflow#1570
Together with kubeflow/common#189

There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
georgkaleido added a commit to georgkaleido/training-operator that referenced this issue Jun 9, 2022
Fixes kubeflow#1570
Together with kubeflow/common#189

There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
google-oss-prow bot pushed a commit that referenced this issue Jun 9, 2022
Fixes #1570
Together with kubeflow/common#189

There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant