Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/4319 ddp checkpoint #4323

Merged
merged 10 commits into from
Oct 24, 2020
Merged

Bug/4319 ddp checkpoint #4323

merged 10 commits into from
Oct 24, 2020

Conversation

SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Oct 23, 2020

What does this PR do?

Fixes #4319 and related to #4223. We now synchronise the rank 0 checkpoint, as well as wait for saving to finish before ending ddp train.

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team October 23, 2020 15:47
@SeanNaren SeanNaren added the bug Something isn't working label Oct 23, 2020
@@ -188,6 +189,13 @@ def get_device_ids(self):
device_ids = [self.trainer.root_gpu]
return device_ids

def sync_processes_best_model_path(self):
best_model_path = self.broadcast(self.trainer.checkpoint_callback.best_model_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm curious why the broadcast is needed. the issue in #4323 seems like metrics aren't synced globally, leading to inconsistent paths across ranks.

if we want to cover this case, do we also need to broadcast other properties like best_model_score from rank 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think #4323 is really covered well in this PR, and I was questioning whether whether to include the broadcast or not... the primary thing we have to do is await all processes completion of saving for #4319. Forcing the same best model path on all ranks allows them to correctly pick the best model, as rank 0 is the only rank that has control over the saving of checkpoints.

If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html

I'm still curious if theres a better way to handle the natively the different val scores, but unsure if it's required for this PR. Maybe omitting the broadcast and just doing a barrier would suffice, thoughts @ananthsub?

Copy link
Contributor

@ananthsub ananthsub Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're getting different scores across processes, this is a deeper issue of how to consolidate them. I think moving forward we should push the metrics API which handles syncing of metrics across ranks: https://pytorch-lightning.readthedocs.io/en/latest/metrics.html

I totally agree

If we really need the broadcasts, can they move to the model checkpoint callback? the callback could implement the hook for on_fit_end and do the broadcasts inside so we can keep the properties in sync in a single location

other notes:

  • do we need these on other accelerators? it looks like the same bug could affect the DDP torchelastic accelerator
  • we can skip this barrier if the checkpoint callback doesn't exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good shout, will do!

Copy link
Contributor Author

@SeanNaren SeanNaren Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub let me know if you agree with the changes I just made :)

EDIT: bit early, need to fix the integration still it seems...

EDIT2: works fine, my test deletes the temp dir a bit too early, because it happens across procs :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'm debugging what I think is the same race condition in DDP torchelastic. I think we need the same barrier across the DDP accelerators

Copy link
Contributor Author

@SeanNaren SeanNaren Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want me to add them across the accelerators in the same place? In particular the ddp_torchelastic accelerator, unsure what else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon is this needed across all DDP accelerators? ddp/ddp_cpu/ddp2 + slurm and torchelastic variants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. i think any dist type of call needs to be implemented in each accelerator with however they do it and then call it from the accelerator.

Copy link
Contributor

@williamFalcon williamFalcon Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie: change

if tpu:
  x()

if gpu:
  y()

...

TO

TPU:

def shared_fx():
   x()

GPU

def shared_fx():
   y()

then you use
accelerator.shared_fx()

@SeanNaren
Copy link
Contributor Author

SeanNaren commented Oct 23, 2020

Windows tests are failing due to lack of torch dist being available here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/bug/4319-ddp-checkpoint/pytorch_lightning/callbacks/model_checkpoint.py#L242

any ideas to ensure this check @ananthsub? This is starting to bleed into what may be the accelerators role... I could modify the check for 'ddp' but that seems a bit hacky

@@ -145,6 +145,7 @@ def train(self):
results = self.ddp_train(process_idx=self.task_idx, model=model)
if 'WORLD_SIZE' in os.environ:
del os.environ['WORLD_SIZE']
self.barrier('ddp_end_train')
Copy link
Contributor

@ananthsub ananthsub Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should land the barriers across DDP accelerators first as that'll solve the primary race condition. We're now getting some crashes from this bug so I want to mitigate as quickly as possible. cc @williamFalcon

We can address the possible rank inconsistency separately and whether/how to broadcast checkpoint state across ranks. This scenario seems inevitable right now:

  • the checkpoint callback running on rank 0 thinks the best model path is A due to the metrics it sees
  • the checkpoint callback running on rank 1 thinks the best model path is B
  • the callback is configured with save_top_k > 1 so not all checkpoints are saved
  • The top-K checkpoints fall out of sync across ranks due to differences in local metrics
  • We delete checkpoints outside the top-K from rank 0
  • Later, rank 1 tries to load the checkpoint (based on its local state), but it turns out it's been deleted

I'm conflicted: broadcasting from rank-0 would make it deterministic, but it might hide more issues in the user's code. I think we'd need to clearly document it and have a giant warning that metrics need to be synced across ranks in distributed training when saving checkpoints based on that monitored value. the same could apply for early stopping too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the barrier be done per accelerator? or should it be added to the trainer itself, right after this spot? https://github.com/PyTorchLightning/pytorch-lightning/blob/f6efb712edccfa83d630db3e97493cd36ab71497/pytorch_lightning/trainer/trainer.py#L439-L440

Copy link
Contributor Author

@SeanNaren SeanNaren Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree currently the logic is severely broken because of no barrier. From your report it seems like this issue is larger than just DDP (which was my worry), so I think enforcing a barrier in the trainer is important.

EDIT:

The alternative I see is by incorporating it into the teardown of each accelerator (that overrides the teardown), but that makes it less obvious of the importance. However the test made me release the logic within the TPU accelerator doesn't parallelise teardown/any further communication. I'll make changes to the branch and let me know what you think @ananthsub

@SeanNaren SeanNaren force-pushed the bug/4319-ddp-checkpoint branch 2 times, most recently from 7f1a01a to 473a3c5 Compare October 24, 2020 10:20
@Lightning-AI Lightning-AI deleted a comment from pep8speaks Oct 24, 2020
@SeanNaren SeanNaren force-pushed the bug/4319-ddp-checkpoint branch from 02c4616 to 043e06c Compare October 24, 2020 14:06
@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #4323 into master will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4323   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         111     111           
  Lines        8024    8021    -3     
======================================
- Hits         7462    7459    -3     
  Misses        562     562           

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much simpler!

@awaelchli awaelchli added checkpointing Related to checkpointing distributed Generic distributed-related topic labels Oct 24, 2020
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice
I can confirm for dpp the issue on master and this PR fixes it.

@mergify mergify bot requested a review from a team October 24, 2020 17:03
@SeanNaren SeanNaren added this to the 1.0.x milestone Oct 24, 2020
@williamFalcon
Copy link
Contributor

wow, yes, very elegant fix.

I missed the question about whether we need to do a barrier on each accelerator. The answer to that is yes. Anytime we need a barrier we should call accelerator.barrier(). In accelerators without a barrier (like CPU), that's a no-op anyhow.

@williamFalcon williamFalcon merged commit 5641b26 into master Oct 24, 2020
@williamFalcon
Copy link
Contributor

Thanks for the quick fix @SeanNaren and the great review @ananthsub !

@SeanNaren SeanNaren deleted the bug/4319-ddp-checkpoint branch October 24, 2020 21:02
@edenlightning edenlightning modified the milestones: 1.0.x, 1.1 Oct 24, 2020
@ananthsub
Copy link
Contributor

this was a fun one :) It's pretty amazing how much restructuring there's been as a result of #3545 , especially for DDP. awesome fix @SeanNaren !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing distributed Generic distributed-related topic
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After DDP train processes have different best val paths
6 participants