Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: update to ROCm 6.0.2 and test MI300 #30266

Merged
merged 22 commits into from
May 13, 2024
Merged

CI: update to ROCm 6.0.2 and test MI300 #30266

merged 22 commits into from
May 13, 2024

Conversation

fxmarty
Copy link
Contributor

@fxmarty fxmarty commented Apr 16, 2024

As per title

A few tests are failing due to torch<2.2 and will be fixed once we bump to torch 2.3 + rocm6.0

FAILED tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

This PR requires a self hosted runner on MI300 first.

Report callbacks are skipped from the trainer tests, as codecarbon is not supported on ROCm and report callbacks are anyway tested independently (in test_trainer_callback.py).

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@fxmarty fxmarty requested review from ydshieh and glegendre01 April 17, 2024 08:39
@ydshieh ydshieh self-assigned this Apr 22, 2024
@mht-sharma
Copy link
Contributor

As per title

A few tests are failing due to torch<2.2 and will be fixed once we bump to torch 2.3 + rocm6.0

FAILED tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

This PR requires a self hosted runner on MI300 first.

Report callbacks are skipped from the trainer tests, as codecarbon is not supported on ROCm and report callbacks are anyway tested independently (in test_trainer_callback.py).

With torch==2.3 RC, from above 4 the following tests passed:

tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting 
tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting
tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu

The third test failed with:

FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - AssertionError: Tensor-likes are not close!

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 24, 2024

I will review tomorrow.

@fxmarty Do we already have a MI300 runner?

Also, could you trigger the workflow that would build the image whose dockerfile is modified in this PR?

(don't forget to comment out other jobs in the docker image build workflow)

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @fxmarty for this PR. LGTM.

I will resolve the conflict later today or tomorrow.

For transparency, the (nvidia) daily CI workflow file(s) has/have many changes in the past few month, and I haven't applied the same changes to AMD workflow files.

I will do that next week! But this PR is fine to be merged. I just want to know if the docker image could be built.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 25, 2024

conflicts resolved

@ydshieh
Copy link
Collaborator

ydshieh commented May 2, 2024

Hi @mht-sharma

Is anything I should wait before I merge this PR?


jobs:
build-docker-containers:
# TODO: remove this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fxmarty Before I merge, could you remove this and any other places that should be removed if any.

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gently ping @fxmarty 🙏

@mht-sharma
Copy link
Contributor

mht-sharma commented May 10, 2024

Hi @mht-sharma

Is anything I should wait before I merge this PR?

As discussed on slack, all good from my side!

@ydshieh
Copy link
Collaborator

ydshieh commented May 13, 2024

I have removed the TODO: remove this.

I will merge as the PR is currently. They will need some updates however to match the recent changes on the nvidia daily CI workflow files.

@ydshieh
Copy link
Collaborator

ydshieh commented May 13, 2024

Thank you for all the work you have done @fxmarty and @mht-sharma

@ydshieh ydshieh merged commit 37bba2a into main May 13, 2024
21 checks passed
@ydshieh ydshieh deleted the mi300-ci branch May 13, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants