PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

georgkaleido · 2022-04-06T13:03:10Z

When a PyTorchJob with restartPolicy=OnFailure is created, it won't handle all errors correctly.

In case the worker pod status is in Status=Failed due to a graceful node shutdown it won't get recreated, but instead the the PytorchJob while fail in this block:

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Lines 431 to 443 in 8c43231

    
           } else { 
        
           	msg := fmt.Sprintf("PyTorchJob %s is failed because %d %s replica(s) failed.", pytorchjob.Name, failed, rtype) 
        
           	r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobFailedReason, msg) 
        
           	if pytorchjob.Status.CompletionTime == nil { 
        
           		now := metav1.Now() 
        
           		pytorchjob.Status.CompletionTime = &now 
        
           	} 
        
           	err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobFailed, commonutil.JobFailedReason, msg) 
        
           	if err != nil { 
        
           		commonutil.LoggerForJob(pytorchjob).Infof("Append job condition error: %v", err) 
        
           		return err 
        
           	} 
        
           	trainingoperatorcommon.FailedJobsCounterInc(pytorchjob.Namespace, pytorchv1.FrameworkName)

I think a fix would be to recreate the pod if policy is OnFailure here:
https://github.com/kubeflow/common/blob/2b40c8f8991e302920ee5536c0ad49dec6724c66/pkg/controller.v1/common/pod.go#L350-L360

I would go ahead an create a PR for this unless there is another way to go about this.

Sample job:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: restart-failure
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: python
              tty: true
              stdin: true

Can be tested with preemptible nodes on GKE

The text was updated successfully, but these errors were encountered:

Fixes kubeflow#1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.

Fixes #1570 Together with kubeflow/common#189 There can be pod level failures caused by the system, which would perviously caused the entire job to fail on all policies except ExitCode.

georgkaleido mentioned this issue Apr 8, 2022

Restart job on failure for Always,OnFailure Policy #1572

Merged

1 task

yoanisgil mentioned this issue Jun 8, 2022

Restart job on failure for Always,OnFailure,ExitCode Policy #1605

Closed

1 task

google-oss-prow bot closed this as completed in #1572 Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

georgkaleido commented Apr 6, 2022

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570

Comments

georgkaleido commented Apr 6, 2022