-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ddp_cpu backend for testing ddp without GPUs #1158
Conversation
Hello @neggert! Thanks for updating this PR.
Comment last updated at 2020-04-11 22:45:26 UTC |
Codecov Report
@@ Coverage Diff @@
## master #1158 +/- ##
======================================
Coverage 91% 91%
======================================
Files 67 67
Lines 3742 3760 +18
======================================
+ Hits 3400 3418 +18
Misses 342 342 |
cac041d
to
3f1f0e1
Compare
Have you considered just making ddp work with cpu, when GPUS=0 or GPUS=None? I think if I was a user that would be the intuitive thing for ddp to do without GPUs. |
@@ -855,11 +855,15 @@ def init_ddp_connection(self): | |||
try: | |||
root_node = os.environ['SLURM_NODELIST'].split(' ')[0] | |||
except Exception: | |||
root_node = '127.0.0.2' | |||
root_node = '127.0.0.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the IP change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I knew. 127.0.0.1 works, 127.0.0.2 doesn't.
pytorch_lightning/core/lightning.py
Outdated
|
||
root_node = self.trainer.resolve_root_node_address(root_node) | ||
os.environ['MASTER_ADDR'] = root_node | ||
torch_distrib.init_process_group('nccl', rank=proc_rank, world_size=world_size) | ||
if self.trainer.on_gpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the different backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL only works for GPU. The PyTorch docs recommend using GLOO for CPU training: https://pytorch.org/docs/stable/distributed.html#backends
output = self.module(*inputs, **kwargs) | ||
# output = self.module(*inputs, **kwargs) | ||
# lightning (ddp_cpu) | ||
if self.module.training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copying the changes you made above for the case where self.device_ids
is None. I think previously, we would never hit this code.
@tullie As discussed in Slack, I'll change things so that |
e610a09
to
58f687a
Compare
This pull request is now in conflict... :( |
@neggert how is it going here? need any help? 🐰 |
@neggert can we get this merged so it can go into 0.7.2? |
if self.num_gpus == 0: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these if ...: pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity. I wanted to make clear to the future reader that we're considering this case and intentionally doing nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would skip it lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment could achieve the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced it with a comment
Is this ready for 0.7.2 or push to 0.7.3? |
I would leave it for next 0.7.3 @neggert ? |
Yeah, let's not rush it into a release. 0.7.3 is fine. |
Support for distributed in MacOS was added in Torch 1.3.0
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
* Add tests for distributed backend config * Refactor set_distributed_mode * Use gloo backend on cpu * Use 127.0.0.1 instead of 127.0.0.2 Not totally clear on why this is necessary, but it seemt to work * Update LightningDDP so that it works with CPU * Add ddp_cpu backend and num_processes Trainer arg * PEP8 * Fix test skipping. Inequalities are hard :/ * Skip ddp_cpu test on Windows * Make a few more cases fall back to ddp_cpu * New function name * Flake8 * Don't test distributed on MacOS with torch < 1.3 Support for distributed in MacOS was added in Torch 1.3.0 * Add ddp_cpu and num_processes to docs * Parametrize trainer config tests * Tweak warning Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * Remove redundant test * Replace pass branches with comments * Add missing warnings import * save_path -> root_dir * Use new rank_zero_warn * Whitespace * Apply suggestions from code review * formatting Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
New feature: DDP on CPU. This can be used for distributed training with out GPUs, or to test/debug DDP on single machine without GPUs.. Since PyTorch already makes good use of multiple cores under the hood, this will not provide any speedup over normal CPU training if you're only using a single node.
API changes:
distributed_backend="ddp_cpu"
Trainer
argument:num_processes
. Controls the number or processes per node. Is currently only used byddp_cpu
.Still need to update the documentation, but I wanted to get some eyes on this before I got too far. Right now everything if functional as far as I can tell, and I've added a new test that covers this feature.