Jobs failing when a node is preempted

On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.

I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
 It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:

```
  Message:               TFJob myjob has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed
  Replica Statuses:
    Chief:
    Evaluator:
      Active:  1
    PS:
      Active:  4
    Worker:
      Active:  6
      Failed:  1
```

Is there something that needs to be done to make tfjobs handle pre-empted nodes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs failing when a node is preempted #999

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development