You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.
This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)
Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.
This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)
Wont work:
/bin/sh -c "python3 adaptdl_training_code.py"
Will work:
python3 adaptdl_training_code.py
See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
The text was updated successfully, but these errors were encountered: