launcher/multinode_runner.py: mapping env variables #3372
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
launcher/multinode_runner.py: mapping env variables in running cmd for mpich runner.
Previously, launching deepspeed with mpich could not properly set env variables like "RANK", "LOCAL_RANK", "WORLD_SIZE" and "LOCAL_SIZE", which deepspeed would use. They would be different names like "PMI_RANK".
Thus, we consider to set them by -genv / -env as the mpirun args. The "-genv" is used to set general env variables like "WORLD_SIZE", while the "-env" is used to set rank specific env variables like "RANK" and "LOCAL_RANK".
To simply demonstrate my change, below is an example of running cmd, only using 2 ranks:
[INFO] [runner.py:540:main] cmd = mpirun -genv PYTHONSTARTUP=/.../pythonstart -genv PYTHONPATH=/../ -genv MASTER_ADDR xxx -genv MASTER_PORT xxx -genv WORLD_SIZE 2 -genv LOCAL_SIZE 2 -n 1 -host xxx -env RANK 0 -env LOCAL_RANK 0 /../bin/python -u pretrain_gpt.py ... : -n 1 -host xx -env RANK 1 -env LOCAL_RANK 1 /../bin/python -u pretrain_gpt.py ...