-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug/4319 ddp checkpoint #4323
Bug/4319 ddp checkpoint #4323
Conversation
@@ -188,6 +189,13 @@ def get_device_ids(self): | |||
device_ids = [self.trainer.root_gpu] | |||
return device_ids | |||
|
|||
def sync_processes_best_model_path(self): | |||
best_model_path = self.broadcast(self.trainer.checkpoint_callback.best_model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm curious why the broadcast is needed. the issue in #4323 seems like metrics aren't synced globally, leading to inconsistent paths across ranks.
if we want to cover this case, do we also need to broadcast other properties like best_model_score
from rank 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think #4323 is really covered well in this PR, and I was questioning whether whether to include the broadcast or not... the primary thing we have to do is await all processes completion of saving for #4319. Forcing the same best model path on all ranks allows them to correctly pick the best model, as rank 0 is the only rank that has control over the saving of checkpoints.
If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html
I'm still curious if theres a better way to handle the natively the different val scores, but unsure if it's required for this PR. Maybe omitting the broadcast and just doing a barrier would suffice, thoughts @ananthsub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html
I totally agree
If we really need the broadcasts, can they move to the model checkpoint callback? the callback could implement the hook for on_fit_end
and do the broadcasts inside so we can keep the properties in sync in a single location
other notes:
- do we need these on other accelerators? it looks like the same bug could affect the DDP torchelastic accelerator
- we can skip this barrier if the checkpoint callback doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good shout, will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananthsub let me know if you agree with the changes I just made :)
EDIT: bit early, need to fix the integration still it seems...
EDIT2: works fine, my test deletes the temp dir a bit too early, because it happens across procs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I'm debugging what I think is the same race condition in DDP torchelastic. I think we need the same barrier across the DDP accelerators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want me to add them across the accelerators in the same place? In particular the ddp_torchelastic accelerator, unsure what else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon is this needed across all DDP accelerators? ddp/ddp_cpu/ddp2 + slurm and torchelastic variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. i think any dist type of call needs to be implemented in each accelerator with however they do it and then call it from the accelerator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie: change
if tpu:
x()
if gpu:
y()
...
TO
TPU:
def shared_fx():
x()
GPU
def shared_fx():
y()
then you use
accelerator.shared_fx()
Windows tests are failing due to lack of torch dist being available here: any ideas to ensure this check @ananthsub? This is starting to bleed into what may be the accelerators role... I could modify the check for 'ddp' but that seems a bit hacky |
@@ -145,6 +145,7 @@ def train(self): | |||
results = self.ddp_train(process_idx=self.task_idx, model=model) | |||
if 'WORLD_SIZE' in os.environ: | |||
del os.environ['WORLD_SIZE'] | |||
self.barrier('ddp_end_train') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should land the barriers across DDP accelerators first as that'll solve the primary race condition. We're now getting some crashes from this bug so I want to mitigate as quickly as possible. cc @williamFalcon
We can address the possible rank inconsistency separately and whether/how to broadcast checkpoint state across ranks. This scenario seems inevitable right now:
- the checkpoint callback running on rank 0 thinks the best model path is A due to the metrics it sees
- the checkpoint callback running on rank 1 thinks the best model path is B
- the callback is configured with
save_top_k > 1
so not all checkpoints are saved - The top-K checkpoints fall out of sync across ranks due to differences in local metrics
- We delete checkpoints outside the top-K from rank 0
- Later, rank 1 tries to load the checkpoint (based on its local state), but it turns out it's been deleted
I'm conflicted: broadcasting from rank-0 would make it deterministic, but it might hide more issues in the user's code. I think we'd need to clearly document it and have a giant warning that metrics need to be synced across ranks in distributed training when saving checkpoints based on that monitored value. the same could apply for early stopping too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the barrier be done per accelerator? or should it be added to the trainer itself, right after this spot? https://github.com/PyTorchLightning/pytorch-lightning/blob/f6efb712edccfa83d630db3e97493cd36ab71497/pytorch_lightning/trainer/trainer.py#L439-L440
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree currently the logic is severely broken because of no barrier. From your report it seems like this issue is larger than just DDP (which was my worry), so I think enforcing a barrier in the trainer is important.
EDIT:
The alternative I see is by incorporating it into the teardown of each accelerator (that overrides the teardown), but that makes it less obvious of the importance. However the test made me release the logic within the TPU accelerator doesn't parallelise teardown/any further communication. I'll make changes to the branch and let me know what you think @ananthsub
7f1a01a
to
473a3c5
Compare
…for main process to save
…to ensure all processes complete
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
Co-authored-by: ananthsub <ananth.subramaniam@gmail.com>
… syncronize and wait for all processes to finish before completing training
… from base accelerator
02c4616
to
043e06c
Compare
Codecov Report
@@ Coverage Diff @@
## master #4323 +/- ##
======================================
- Coverage 93% 93% -0%
======================================
Files 111 111
Lines 8024 8021 -3
======================================
- Hits 7462 7459 -3
Misses 562 562 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much simpler!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
I can confirm for dpp the issue on master and this PR fixes it.
wow, yes, very elegant fix. I missed the question about whether we need to do a barrier on each accelerator. The answer to that is yes. Anytime we need a barrier we should call accelerator.barrier(). In accelerators without a barrier (like CPU), that's a no-op anyhow. |
Thanks for the quick fix @SeanNaren and the great review @ananthsub ! |
this was a fun one :) It's pretty amazing how much restructuring there's been as a result of #3545 , especially for DDP. awesome fix @SeanNaren ! |
What does this PR do?
Fixes #4319 and related to #4223. We now synchronise the rank 0 checkpoint, as well as wait for saving to finish before ending ddp train.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃