-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for dist not being initialized when constructing main config #3324
Conversation
I confirm that this change fixes the problem. Could we please merge it asap, since other Deepspeed PRs are afflicted as well. Thank you! cc: @tjruwase + @jeffra to review please. and please consider a new release to undo the breakage. p.s. The problem manifests in distributed Accelerate. Not sure how easy it'd be to write a test since Accelerate isn't easy to set up. |
Do you have a minimal example we could add to our tests to ensure we don't break this in the future? |
As I'm not on Accelerate team, I'm just its user, let's ping @pacman100 who might be able to help with contributing such test. But since it could take days it's far more urgent to merge your fix first. Thank you. |
Thank you for the quick fix and merge, Michael! The only remaining step is a new minor release, since many users will not know to install from master. |
@stas00 v0.9.1 is now live: |
Amazing! Thank you, Jeff! |
Hello, as this issue only comes up when using multi-gpu/multi-node setup, all the current slow tests in accelerate are failing (see the screenshots below). I don't know if DeepSpeed runs these slow tests on their front. Commands to run the slow tests from accelerate main folder on a node with 2 gpus
|
Fix for #3228
Also removing compatibility for torch < 1.8 in
deepspeed.comm
because we no longer officially support older torch versions.