-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add smddp dist backend option #579
Conversation
Not sure how to handle the flake8 AttributeError |
#576 is merged, please rebase your code to the most recent main branch to avoid lint issue. |
if backend == 'smddp': | ||
try: | ||
import smdistributed.dataparallel.torch.torch_smddp # noqa: F401 | ||
except ModuleNotFoundError as e: | ||
raise ModuleNotFoundError( | ||
'Please use an Amazon SageMaker DLC to access smdistributed: ' | ||
'https://github.com/aws/deep-learning-containers/blob/master' | ||
'/available_images.md#sagemaker-framework-containers' | ||
'-sm-support-only') from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I have no experience with Amazon SageMaker, but is smddp
only compatible with mpi
launcher? If not, I think it's better to do the same thing in other launcher (e.g. _init_dist_pytorch
), or maybe we can simply place this import
statement into init_dist
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, from my understanding, currently when smdistributed is enabled it is always automatically launched with mpirun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable
* Upgrade the versions of pre-commit-hooks * update zh-cn.yaml
* Update config.md * Update config.md
* Add smddp dist backend option * [Dev]: Upgrade pre commit hooks (open-mmlab#576) * Upgrade the versions of pre-commit-hooks * update zh-cn.yaml * [Docs] Fix the docstring of model sub-package (open-mmlab#573) * [Doc]: Update config.md (open-mmlab#562) * Update config.md * Update config.md * [Doc] delete the error comment in docs (open-mmlab#514) Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: Zhengfei-0311 <78833899+Zhengfei-0311@users.noreply.github.com> Co-authored-by: vansin <msnode@163.com>
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
This adds smddp backend support for training on Amazon SageMaker as discussed in #567
Modification
SMDDP launches with MPI, so this updates
get_comm_device
and_init_dist_mpi
to support this backend.BC-breaking (Optional)
I don't think there's any breaking changes.
Use cases (Optional)
This is for use with the SageMaker PyTorch Estimator. For example:
Checklist