-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to select individual TPU core to train on #1729
Allow user to select individual TPU core to train on #1729
Conversation
added tpu_id to mixins
@lezwon there shouldn't be a new flag, it should behave like gpus... And I guess we need to remove the num_tpu_cores arg and replace with the tpu_cores? Trainer(tpu_cores=1)
Trainer(tpu_cores=8)
Trainer(tpu_cores=[2])
Trainer(tpu_cores=[6]) @PyTorchLightning/core-contributors |
Do we really need a new flag? Can it be solved by generalized gpus... Rename gpus to something meaningful for both GPU / TPU and eventually CPU as we have also distributed cpu backend... What a out just |
@williamFalcon I suppose something like |
I think we need to be explicit about GPUs and TPUs... let's keep: Trainer(gpus)
Trainer(tpu_cores=) The reason is that a TPU has many cores whereas a GPU is a single unit... TPU cores can also be 128, 1024, etc... if ran on a pod... |
pytorch_lightning/trainer/trainer.py
Outdated
@@ -90,6 +90,7 @@ def __init__( | |||
gpus: Optional[Union[List[int], str, int]] = None, | |||
auto_select_gpus: bool = False, | |||
num_tpu_cores: Optional[int] = None, | |||
tpu_id: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I can use only one TPU? not several with indexes like GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as per my knowledge. xla_distributed only supports 1 or 8 cores. We can't selectively choose the cores.
Ref: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpu_id is not needed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon I have replaced it with tpu_cores
as you suggested. Valid values are 1
/8
/[<1-(max_cores)>]
743bc4a
to
c0a4f9d
Compare
This pull request is now in conflict... :( |
Codecov Report
@@ Coverage Diff @@
## master #1729 +/- ##
========================================
Coverage ? 88%
========================================
Files ? 69
Lines ? 4163
Branches ? 0
========================================
Hits ? 3674
Misses ? 489
Partials ? 0 |
This pull request is now in conflict... :( |
removed self.tpu_id for ParallelLoader
@williamFalcon @Borda I need a review on this PR. |
@@ -498,7 +499,7 @@ def single_gpu_train(self, model): | |||
|
|||
def tpu_train(self, tpu_core_idx, model): | |||
# put model on tpu | |||
model.to(xm.xla_device()) | |||
model.to(xm.xla_device(self.tpu_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this now makes it ONLY possible to train on 1 core no? not multiple cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so... @lezwon ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed that if self.tpu_id
is None
and I use xmp.spawn
, the model trains at the same speed it trains when all cores are being used. So I assumed that all cores are being used. I could add some logging to confirm. Or just add a conditional for xm.xla_device()
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ONLY when the user requests a specific TPU index should we use
model.to(xm.xla_device(self.tpu_id))
otherwise, leave it as it was.
@Borda we need TPU tests to make sure this PR doesn't break functionality
This pull request is now in conflict... :( |
@Borda I have made the requested changes. Need your review on it :] |
@williamFalcon sure.. Let's do it 👍 |
give me a sec to check it... |
@@ -189,7 +189,10 @@ def __init__( | |||
GPUs are configured to be in "exclusive mode", such | |||
that only one process at a time can access them. | |||
|
|||
num_tpu_cores: How many TPU cores to train on (1 or 8). | |||
tpu_cores: How many TPU cores to train on (1 or 8) / Single TPU to train on [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just 1 or 8, nothing in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The nprocs
argument for xm.spawn
supports either 1 or max number of devices.
Source: https://pytorch.org/xla/release/1.5/index.html#torch_xla.distributed.xla_multiprocessing.spawn
|
||
if tpu_cores is None: | ||
tpu_cores = num_tpu_cores | ||
self.on_tpu = tpu_cores is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not directly related to this PR, but it is not clear what will happen if a user sets gpus
and tpu_cores
in the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this out and provide an update :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Borda I ran this on kaggle:
trainer = pl.Trainer(tpu_cores=[1], gpus=[2], precision=16, max_epochs=20)
It threw an Exception:
MisconfigurationException:
You requested GPUs: [2]
But your machine only has: []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, it was not relaying to this PR comment but more for us as conceptual think how to handle these configurations...
what would be you expected behaviour as a user if you set gpu and tpu?
- take any of them if available
- which has higher priority (to be used) if both are available
* fix chlog * test for #1729 * hist * update * Document use case of passing test dataloaders to Trainer.test() (#1992) * Issue 1990 Doc patch. * Codeblock directive. * Update to reflect current state of pytorch-lightning * Final grammar cleaning. I hope these commits are squashed. * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: authman <uapatira@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* fix chlog * test for #1729 * hist * update * Document use case of passing test dataloaders to Trainer.test() (#1992) * Issue 1990 Doc patch. * Codeblock directive. * Update to reflect current state of pytorch-lightning * Final grammar cleaning. I hope these commits are squashed. * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: authman <uapatira@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Before submitting
What does this PR do?
Fixes #1539
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃