Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

rmfan · 2021-10-29T17:23:08Z

Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.

This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)

Wont work: /bin/sh -c "python3 adaptdl_training_code.py"

Will work: python3 adaptdl_training_code.py

See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

rmfan commented Oct 29, 2021 •

edited

Loading

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

Comments

rmfan commented Oct 29, 2021 • edited Loading

rmfan commented Oct 29, 2021 •

edited

Loading