-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix disabling progress bar on non-zero ranks using Horovod backend #1709
Fix disabling progress bar on non-zero ranks using Horovod backend #1709
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1709 +/- ##
======================================
- Coverage 88% 88% -0%
======================================
Files 69 69
Lines 4135 4151 +16
======================================
+ Hits 3659 3670 +11
- Misses 476 481 +5 |
I have seen this error also in other PRs
|
This is the same caching error we've seen before. Basically, something is causing this pattern:
If |
Thx, but how it is possible that the failing is kind of random... |
My guess is that the presence or absence of elements in the cache can vary based on a number of different factors (like which host the test lands on, or what sequence different tests were run), so there will be times when we see a cache hit (which will cause the problem) and a other times a cache miss (which will avoid the problem). |
@tgaddair just rerun the tests and pass without any change so really curious what is happening with the cache... |
f372275
to
482006f
Compare
Hey @Borda, I think that PR will fix it most of the time, but there are still a couple cases where it could break:
In either of those cases, the cache will miss on PyTorch but hit on Horovod, which will cause the issue. I put together a quick change that should allow Horovod to use the cache optimistically, and only reinstall if it has to. |
This pull request is now in conflict... :( |
Hey @Borda @williamFalcon, flakiness in the GitHub CI tests should be reduced with this change. Ready to land whenever you're ready. |
do we need a better general solution? i thought we already made it so that prog bar only showed on rank 0? |
Currently it seems that every framework will independently call |
@awaelchli @hadim any idea? ^^ |
I planned to add the |
This PR also adds a barrier when restoring model weights, consistent with other distributed backends.