Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distrib #635

Merged
merged 29 commits into from
Oct 24, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
2afc205
[WIP] Added cifar10 distributed example
vfdev-5 Aug 1, 2019
7b8eac9
[WIP] Metric with all reduce decorator and tests
vfdev-5 Aug 1, 2019
c7d2337
[WIP] Added tests for accumulation metric
vfdev-5 Aug 1, 2019
69ced1e
[WIP] Updated with reinit_is_reduced
vfdev-5 Aug 1, 2019
f2f923b
[WIP] Distrib adaptation for other metrics
vfdev-5 Aug 2, 2019
d13b985
[WIP] Warnings for EpochMetric and Precision/Recall when distrib
vfdev-5 Aug 2, 2019
e7d12d0
Updated metrics and tests to run on distributed configuration
vfdev-5 Aug 3, 2019
0a5f582
Minor fixes and cosmetics
vfdev-5 Aug 3, 2019
954269c
Merge branch 'master' into distrib
vfdev-5 Aug 3, 2019
206f2e1
Fixed bugs and improved contrib/cifar10 example
vfdev-5 Aug 3, 2019
99a6b4a
Updated docs
vfdev-5 Aug 3, 2019
3eff370
Update metrics.rst
vfdev-5 Aug 6, 2019
ad8375c
Updated docs and set device as "cuda" in distributed instead of raisi…
vfdev-5 Aug 6, 2019
0bcc287
[WIP] Fix missing _is_reduced in precision/recall with tests
vfdev-5 Aug 7, 2019
1bda698
Merge remote-tracking branch 'origin' into distrib
vfdev-5 Aug 7, 2019
7dd6937
Updated other tests
vfdev-5 Aug 7, 2019
27324dc
Merge branch 'master' into distrib
vfdev-5 Aug 29, 2019
f4a3d4b
Updated travis and renamed tbptt test gpu -> cuda
vfdev-5 Aug 29, 2019
2036075
Distrib (#573)
vfdev-5 Aug 30, 2019
69502fc
Merge branch 'distrib' of https://github.com/pytorch/ignite into distrib
vfdev-5 Sep 9, 2019
d52c36d
Merge branch 'master' into distrib
vfdev-5 Sep 9, 2019
ecb00a5
Merge branch 'master' into distrib
vfdev-5 Sep 13, 2019
71836aa
Merge branch 'master' into distrib
vfdev-5 Sep 25, 2019
46cdd86
Compute IoU, Precision, Recall based on CM on CPU
vfdev-5 Sep 26, 2019
fd14d4d
Fixes incomplete merge with 1856c8e0f1be102d4530592bcb7caac690f198c4
vfdev-5 Sep 26, 2019
59b894c
Merge branch 'master' into distrib
vfdev-5 Oct 17, 2019
80ad40a
Update distrib branch and CIFAR10 example (#647)
vfdev-5 Oct 22, 2019
8288831
Finalized Cifar10 example (#649)
vfdev-5 Oct 24, 2019
25db95b
Merge branch 'master' into distrib
vfdev-5 Oct 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Updated docs and set device as "cuda" in distributed instead of raisi…
…ng error
  • Loading branch information
vfdev-5 committed Aug 6, 2019
commit ad8375c27644acac46b8a6ff9de1ea2bc022762c
4 changes: 2 additions & 2 deletions examples/contrib/cifar10/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ def run(output_path, config):

distributed = backend is not None
if distributed:
torch.cuda.device(config['local_rank'])
device = "cuda:{}".format(config['local_rank'])
torch.cuda.set_device(config['local_rank'])
device = "cuda"

train_labelled_loader, test_loader = \
get_train_test_loaders(path=config['data_path'],
Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/accumulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,10 @@ class VariableAccumulation(Metric):
:class:`~ignite.engine.Engine`'s `process_function`'s output into the
form expected by the metric. This can be useful if, for example, you have a multi-output model and
you want to compute the metric with respect to one of the outputs.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,10 @@ def thresholded_output_transform(output):
form expected by the metric. This can be useful if, for example, you have a multi-output model and
you want to compute the metric with respect to one of the outputs.
is_multilabel (bool, optional): flag to use in multilabel case. By default, False.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/confusion_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@ class ConfusionMatrix(Metric):
:class:`~ignite.engine.Engine`'s `process_function`'s output into the
form expected by the metric. This can be useful if, for example, you have a multi-output model and
you want to compute the metric with respect to one of the outputs.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/loss.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,10 @@ class Loss(Metric):
keywords arguments.
batch_size (callable): a callable taking a target tensor that returns the
first dimension size (usually the batch size).
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
9 changes: 5 additions & 4 deletions ignite/metrics/metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ class Metric(with_metaclass(ABCMeta, object)):
:class:`~ignite.engine.Engine`'s `process_function`'s output into the
form expected by the metric. This can be useful if, for example, you have a multi-output model and
you want to compute the metric with respect to one of the outputs.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand All @@ -33,8 +35,7 @@ def __init__(self, output_transform=lambda x: x, device=None):
# Check device if distributed is initialized:
if torch.distributed.is_available() and torch.distributed.is_initialized():
if device is None:
raise ValueError("Please provide the device for distributed computation. "
"In most of the cases, it should defined as 'cuda:local_rank'.")
device = "cuda"
device = torch.device(device)
self._device = device
self._is_reduced = False
Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/precision.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,10 @@ def thresholded_output_transform(output):
in multiclass case), otherwise, returns a tensor with the precision (for each class in multiclass case).
is_multilabel (bool, optional) flag to use in multilabel case. By default, value is False. If True, average
parameter should be True and the average is computed across samples, instead of classes.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
6 changes: 4 additions & 2 deletions ignite/metrics/recall.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,10 @@ def thresholded_output_transform(output):
in multiclass case), otherwise, returns a tensor with the precision (for each class in multiclass case).
is_multilabel (bool, optional) flag to use in multilabel case. By default, value is False. If True, average
parameter should be True and the average is computed across samples, instead of classes.
device (str of torch.device): device specification in case of distributed computation usage.
In most of the cases, it should defined as "cuda:local_rank".
device (str of torch.device, optional): device specification in case of distributed computation usage.
In most of the cases, it can be defined as "cuda:local_rank" or "cuda"
if already set `torch.cuda.set_device(local_rank)`. By default, if a distributed process group is
initialized and available, device is set to `cuda`.

"""

Expand Down
17 changes: 0 additions & 17 deletions tests/ignite/metrics/test_metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -469,27 +469,10 @@ def test__sync_all_reduce():
@pytest.mark.skipif(torch.cuda.device_count() < 1, reason="Skip if no GPU")
def test_distrib(local_rank, distributed_context_single_node):

def test_distrib_no_device_metric():
import torch.distributed as dist
assert dist.is_available() and dist.is_initialized()

with pytest.raises(ValueError, match=r"Please provide the device for distributed computation."):
DummyMetric()

test_distrib_no_device_metric()

def test_distrib__sync_all_reduce():
import torch.distributed as dist
assert dist.is_available() and dist.is_initialized()

# # This test should be the first in the list, otherwise stucked
# # The following test aimed to check the transfer from another cuda device to the default one
# # However, this test sometimes gets stucked
# m = DummyMetric(device="cuda:{}".format(local_rank))
# t = torch.tensor(10, device="cuda:1")
# res = m._sync_all_reduce(t)
# assert res.item() == 10 * dist.get_world_size()

device = "cuda:{}".format(local_rank)

m = DummyMetric(device=device)
Expand Down