Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smddp dist backend option #579

Merged
merged 6 commits into from
Oct 8, 2022
Merged

Add smddp dist backend option #579

merged 6 commits into from
Oct 8, 2022

Conversation

austinmw
Copy link
Contributor

@austinmw austinmw commented Oct 7, 2022

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

This adds smddp backend support for training on Amazon SageMaker as discussed in #567

Modification

SMDDP launches with MPI, so this updates get_comm_device and _init_dist_mpi to support this backend.

BC-breaking (Optional)

I don't think there's any breaking changes.

Use cases (Optional)

This is for use with the SageMaker PyTorch Estimator. For example:

from sagemaker.pytorch import PyTorch

est = PyTorch(
    image_uri=image_uri,
    source_dir='code',
    entry_point='sagemaker_mmdetection_train.py',
    instance_type='ml.p4d.24xlarge',
    instance_count=2,
    hyperparameters={
        'config': '/opt/ml/code/configs/yolox/yolox_s_8xb8-300e_coco.py',
        'work-dir': '/opt/ml/checkpoints',
        'launcher': 'mpi',
        'cfg-options': "env_cfg.dist_cfg.backend='smddp'",
    },
    distribution = {'smdistributed': {'dataparallel': {'enabled': True}}}
)

est.fit({'coco' : s3_data_path})

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@austinmw
Copy link
Contributor Author

austinmw commented Oct 7, 2022

Not sure how to handle the flake8 AttributeError

@zhouzaida
Copy link
Collaborator

Hi @austinmw , thanks for your contribution. The flake8 error will be fixed by #576.

@zhouzaida zhouzaida requested a review from C1rN09 October 8, 2022 02:20
@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Oct 8, 2022

#576 is merged, please rebase your code to the most recent main branch to avoid lint issue.

@ZwwWayne ZwwWayne added this to the 0.2.0 milestone Oct 8, 2022
Comment on lines +97 to +105
if backend == 'smddp':
try:
import smdistributed.dataparallel.torch.torch_smddp # noqa: F401
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
'Please use an Amazon SageMaker DLC to access smdistributed: '
'https://github.com/aws/deep-learning-containers/blob/master'
'/available_images.md#sagemaker-framework-containers'
'-sm-support-only') from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I have no experience with Amazon SageMaker, but is smddp only compatible with mpi launcher? If not, I think it's better to do the same thing in other launcher (e.g. _init_dist_pytorch), or maybe we can simply place this import statement into init_dist?

Copy link
Contributor Author

@austinmw austinmw Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, from my understanding, currently when smdistributed is enabled it is always automatically launched with mpirun.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

@ZwwWayne ZwwWayne merged commit 8864bd8 into open-mmlab:main Oct 8, 2022
C1rN09 pushed a commit to C1rN09/mmengine that referenced this pull request Nov 1, 2022
* Add smddp dist backend option

* [Dev]: Upgrade pre commit hooks (open-mmlab#576)

* Upgrade the versions of pre-commit-hooks

* update zh-cn.yaml

* [Docs] Fix the docstring of model sub-package (open-mmlab#573)

* [Doc]: Update config.md (open-mmlab#562)

* Update config.md

* Update config.md

* [Doc] delete the error comment  in docs (open-mmlab#514)

Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: Zhengfei-0311 <78833899+Zhengfei-0311@users.noreply.github.com>
Co-authored-by: vansin <msnode@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants