Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for dist not being initialized when constructing main config #3324

Merged
merged 7 commits into from
Apr 20, 2023

Conversation

mrwyattii
Copy link
Contributor

@mrwyattii mrwyattii commented Apr 20, 2023

Fix for #3228

Also removing compatibility for torch < 1.8 in deepspeed.comm because we no longer officially support older torch versions.

deepspeed/__init__.py Outdated Show resolved Hide resolved
@stas00
Copy link
Collaborator

stas00 commented Apr 20, 2023

I confirm that this change fixes the problem. Could we please merge it asap, since other Deepspeed PRs are afflicted as well.

Thank you!

cc: @tjruwase + @jeffra to review please. and please consider a new release to undo the breakage.

p.s. The problem manifests in distributed Accelerate. Not sure how easy it'd be to write a test since Accelerate isn't easy to set up.

@mrwyattii mrwyattii marked this pull request as ready for review April 20, 2023 22:31
@mrwyattii
Copy link
Contributor Author

p.s. The problem manifests in distributed Accelerate. Not sure how easy it'd be to write a test since Accelerate isn't easy to set up.

Do you have a minimal example we could add to our tests to ensure we don't break this in the future?

@stas00
Copy link
Collaborator

stas00 commented Apr 20, 2023

As I'm not on Accelerate team, I'm just its user, let's ping @pacman100 who might be able to help with contributing such test.

But since it could take days it's far more urgent to merge your fix first.

Thank you.

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved
@mrwyattii mrwyattii enabled auto-merge (squash) April 20, 2023 23:08
@mrwyattii mrwyattii disabled auto-merge April 20, 2023 23:54
@mrwyattii mrwyattii merged commit ad168a6 into master Apr 20, 2023
@stas00
Copy link
Collaborator

stas00 commented Apr 21, 2023

Thank you for the quick fix and merge, Michael!

The only remaining step is a new minor release, since many users will not know to install from master.

@jeffra
Copy link
Collaborator

jeffra commented Apr 21, 2023

Thank you for the quick fix and merge, Michael!

The only remaining step is a new minor release, since many users will not know to install from master.

@stas00 v0.9.1 is now live:
https://pypi.org/project/deepspeed/0.9.1/
https://github.com/microsoft/DeepSpeed/releases/tag/v0.9.1

@stas00
Copy link
Collaborator

stas00 commented Apr 21, 2023

Amazing! Thank you, Jeff!

@pacman100
Copy link
Contributor

As I'm not on Accelerate team, I'm just its user, let's ping @pacman100 who might be able to help with contributing such test.

Hello, as this issue only comes up when using multi-gpu/multi-node setup, all the current slow tests in accelerate are failing (see the screenshots below). I don't know if DeepSpeed runs these slow tests on their front.

Commands to run the slow tests from accelerate main folder on a node with 2 gpus

export RUN_SLOW=true
python -m pytest -s -v ./tests/deepspeed

Screenshot 2023-04-21 at 12 02 39 PM

Screenshot 2023-04-21 at 12 02 51 PM

@mrwyattii mrwyattii deleted the mrwyattii/fix-dist-init branch April 21, 2023 17:34
@conglongli conglongli added deepspeed-chat Related to DeepSpeed-Chat and removed deepspeed-chat Related to DeepSpeed-Chat labels Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants