Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for sync_bn #2801

Merged
merged 22 commits into from
Aug 5, 2020
Merged

add support for sync_bn #2801

merged 22 commits into from
Aug 5, 2020

Conversation

ananyahjha93
Copy link
Contributor

What does this PR do?

Adds support for global batch norm using sync_bn and allows for customizing sync_bn with a provision to override configure_sync_bn() function in LightningModule.

Fixes #2589, #2509

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team August 2, 2020 20:29
@mergify mergify bot requested a review from a team August 3, 2020 07:45
@pep8speaks
Copy link

pep8speaks commented Aug 3, 2020

Hello @ananyahjha93! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-05 16:14:37 UTC

@mergify
Copy link
Contributor

mergify bot commented Aug 3, 2020

This pull request is now in conflict... :(

@Borda Borda added the feature Is an improvement or enhancement label Aug 4, 2020
pl_examples/basic_examples/sync_bn.py Outdated Show resolved Hide resolved
pl_examples/basic_examples/sync_bn.py Show resolved Hide resolved
pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team August 4, 2020 09:40
@ananyahjha93
Copy link
Contributor Author

ananyahjha93 commented Aug 5, 2020

@Borda @williamFalcon removed apex option as backend, the tests for apex were initially passing because its sync_bn was falling back on pytorch's default version. When I reinstalled apex and got it to call its own sync_bn, there were quite a few issues tensors syncing between GPUs.

Basic idea, the configure_sync_bn function provides a default torch implementation and can be overridden of specific versions are required.

@justusschock ^^^

@ananyahjha93
Copy link
Contributor Author

The DDP script tests were not working as of now, so I have an example in pl_examples/basic_examples which verifies the correct working of sync batch-norm.

@ananyahjha93 ananyahjha93 changed the title [wip] add support for sync_bn add support for sync_bn Aug 5, 2020
@ananyahjha93 ananyahjha93 requested a review from awaelchli August 5, 2020 10:22
@williamFalcon
Copy link
Contributor

sorry, why not put as a test using ddp spawn?

@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #2801 into master will decrease coverage by 31%.
The diff coverage is 20%.

@@           Coverage Diff            @@
##           master   #2801     +/-   ##
========================================
- Coverage      89%     59%    -31%     
========================================
  Files          78      78             
  Lines        7109    6925    -184     
========================================
- Hits         6349    4069   -2280     
- Misses        760    2856   +2096     

@ananyahjha93
Copy link
Contributor Author

@nateraw test_full_loop_ddp_spawn is failing in drone because the test accuracy is not greater than 0.8. tests/core/test_datamodules.py

@williamFalcon williamFalcon merged commit e31c520 into master Aug 5, 2020
@ananyahjha93 ananyahjha93 deleted the sync_bn branch August 5, 2020 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add synced batchnorm support
5 participants