-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: update to ROCm 6.0.2 and test MI300 #30266
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
With torch==2.3 RC, from above 4 the following tests passed:
The third test failed with:
|
I will review tomorrow. @fxmarty Do we already have a MI300 runner? Also, could you trigger the workflow that would build the image whose dockerfile is modified in this PR? (don't forget to comment out other jobs in the docker image build workflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @fxmarty for this PR. LGTM.
I will resolve the conflict later today or tomorrow.
For transparency, the (nvidia) daily CI workflow file(s) has/have many changes in the past few month, and I haven't applied the same changes to AMD workflow files.
I will do that next week! But this PR is fine to be merged. I just want to know if the docker image could be built.
conflicts resolved |
Hi @mht-sharma Is anything I should wait before I merge this PR? |
|
||
jobs: | ||
build-docker-containers: | ||
# TODO: remove this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fxmarty Before I merge, could you remove this and any other places that should be removed if any.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gently ping @fxmarty 🙏
As discussed on slack, all good from my side! |
I have removed the I will merge as the PR is currently. They will need some updates however to match the recent changes on the nvidia daily CI workflow files. |
Thank you for all the work you have done @fxmarty and @mht-sharma |
As per title
A few tests are failing due to
torch<2.2
and will be fixed once we bump to torch 2.3 + rocm6.0This PR requires a self hosted runner on MI300 first.
Report callbacks are skipped from the trainer tests, as codecarbon is not supported on ROCm and report callbacks are anyway tested independently (in
test_trainer_callback.py
).