Skip to content

Jobs failing when a node is preemptedΒ #999

Closed
@matthen

Description

On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.

I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:

  Message:               TFJob myjob has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed
  Replica Statuses:
    Chief:
    Evaluator:
      Active:  1
    PS:
      Active:  4
    Worker:
      Active:  6
      Failed:  1

Is there something that needs to be done to make tfjobs handle pre-empted nodes?

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions