Closed
Description
On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.
I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:
Message: TFJob myjob has failed because 1 Worker replica(s) failed.
Reason: TFJobFailed
Status: True
Type: Failed
Replica Statuses:
Chief:
Evaluator:
Active: 1
PS:
Active: 4
Worker:
Active: 6
Failed: 1
Is there something that needs to be done to make tfjobs handle pre-empted nodes?