Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail. #105

Open
rmfan opened this issue Oct 29, 2021 · 0 comments

Comments

@rmfan
Copy link
Collaborator

rmfan commented Oct 29, 2021

Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.

This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)

Wont work: /bin/sh -c "python3 adaptdl_training_code.py"

Will work: python3 adaptdl_training_code.py

See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant