Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distrib #635

Merged
merged 29 commits into from
Oct 24, 2019
Merged

Distrib #635

merged 29 commits into from
Oct 24, 2019

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Sep 25, 2019

Fixes #568

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

vfdev-5 added 25 commits August 1, 2019 02:18
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes
* [WIP] Added cifar10 distributed example

* [WIP] Metric with all reduce decorator and tests

* [WIP] Added tests for accumulation metric

* [WIP] Updated with reinit_is_reduced

* [WIP] Distrib adaptation for other metrics

* [WIP] Warnings for EpochMetric and Precision/Recall when distrib

* Updated metrics and tests to run on distributed configuration
- Test on 2 GPUS single node
- Added cmd in .travis.yml to indicate how to test locally
- Updated travis to run tests in 4 processes

* Minor fixes and cosmetics

* Fixed bugs and improved contrib/cifar10 example

* Updated docs

* Fixes issue #543 (#572)

* Fixes issue #543

Previous CM implementation suffered from the problem if target contains non-contiguous indices.
New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117

This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class.
Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0)  = no classes (background class),
then confusion matrix does not count any true/false predictions.

* Update confusion_matrix.py

* Update metrics.rst

* Updated docs and set device as "cuda" in distributed instead of raising error

* [WIP] Fix missing _is_reduced in precision/recall with tests

* Updated other tests

* Added mlflow logger (#558)

* Added mlflow logger without tests

* Added mlflow tests, updated mlflow logger code and other tests

* Updated docs and added mlflow in travis

* Added tests for mlflow OptimizerParamsHandler
- additionally added OptimizerParamsHandler for plx with tests

* Update to PyTorch v1.2.0 (#580)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fix SSL problem of failing travis (#581)

* Update .travis.yml

* Update .travis.yml

* Fixed tests and improved travis

* Fixes SSL problem to download model weights

* Fixed travis for deploy and nightly

* Fixes #583 (#584)

* Fixes docs build warnings (#585)

* Return removable handle from Engine.add_event_handler(). (#588)

* Add tests for event removable handle.

Add feature tests for engine.add_event_handler returning removable event
handles.

* Return RemovableEventHandle from Engine.add_event_handler.

* Fixup removable event handle test in python 2.7.

Explicitly trigger gc, allowing cycle detection between engine and
state, in removable handle weakref test. Python 2.7 cycle detection
appears to be less aggressive than python 3+.

* Add removable event handler docs.

Add autodoc configuration for RemovableEventHandler, expand "concepts"
documentation with event remove example following event add example.

* Update concepts.rst

* Updated travis and renamed tbptt test gpu -> cuda
@vfdev-5 vfdev-5 marked this pull request as ready for review October 17, 2019 16:22
* Added tests with gloo, minor updates and fixes

* Added single/multi node tests with gloo and [WIP] with nccl

* Added tests for multi-node nccl, improved examples/contrib/cifar10 example

* Experiments: 1n1gpu, 1n2gpus, 2n2gpus

* Fix flake8

* Fixes #645 (#646)

- fix CI and improve create_lr_scheduler_with_warmup

* Fix tests for python 2.7
* Added gcp tb logger image and updated README

* Added gcp ai platform scripts to run trainings

* Improved docs and readmes
@vfdev-5 vfdev-5 merged commit 53190db into master Oct 24, 2019
@vfdev-5 vfdev-5 deleted the distrib branch October 24, 2019 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt metrics to be used with distributed
1 participant