CI: update to ROCm 6.0.2 and test MI300 #30266

fxmarty · 2024-04-16T11:59:21Z

As per title

A few tests are failing due to torch<2.2 and will be fixed once we bump to torch 2.3 + rocm6.0

FAILED tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

This PR requires a self hosted runner on MI300 first.

Report callbacks are skipped from the trainer tests, as codecarbon is not supported on ROCm and report callbacks are anyway tested independently (in test_trainer_callback.py).

HuggingFaceDocBuilderDev · 2024-04-16T12:19:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tests/trainer/test_trainer_seq2seq.py

docker/transformers-pytorch-amd-gpu/Dockerfile

mht-sharma · 2024-04-24T10:00:28Z

As per title

A few tests are failing due to torch<2.2 and will be fixed once we bump to torch 2.3 + rocm6.0
FAILED tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting - RuntimeError: "softmax_lastdim_kernel_impl" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu - RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
This PR requires a self hosted runner on MI300 first.

Report callbacks are skipped from the trainer tests, as codecarbon is not supported on ROCm and report callbacks are anyway tested independently (in test_trainer_callback.py).

With torch==2.3 RC, from above 4 the following tests passed:

tests/test_pipeline_mixin.py::FillMaskPipelineTests::test_fp16_casting 
tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_fp16_casting
tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_cuda_kernels_tiny_1_cpu

The third test failed with:

FAILED tests/models/mamba/test_modeling_mamba.py::MambaIntegrationTests::test_simple_generate_1_cpu - AssertionError: Tensor-likes are not close!

ydshieh · 2024-04-24T14:25:06Z

I will review tomorrow.

@fxmarty Do we already have a MI300 runner?

Also, could you trigger the workflow that would build the image whose dockerfile is modified in this PR?

(don't forget to comment out other jobs in the docker image build workflow)

ydshieh

Thank you @fxmarty for this PR. LGTM.

I will resolve the conflict later today or tomorrow.

For transparency, the (nvidia) daily CI workflow file(s) has/have many changes in the past few month, and I haven't applied the same changes to AMD workflow files.

I will do that next week! But this PR is fine to be merged. I just want to know if the docker image could be built.

ydshieh · 2024-04-25T16:23:13Z

conflicts resolved

tests/extended/test_trainer_ext.py

ydshieh · 2024-05-02T09:02:17Z

Hi @mht-sharma

Is anything I should wait before I merge this PR?

ydshieh · 2024-05-02T12:11:20Z

.github/workflows/self-scheduled-amd-mi300-caller.yml

+
+jobs:
+  build-docker-containers:
+    # TODO: remove this


@fxmarty Before I merge, could you remove this and any other places that should be removed if any.

Thanks!

gently ping @fxmarty 🙏

mht-sharma · 2024-05-10T05:31:47Z

Hi @mht-sharma

Is anything I should wait before I merge this PR?

As discussed on slack, all good from my side!

ydshieh · 2024-05-13T15:37:15Z

I have removed the TODO: remove this.

I will merge as the PR is currently. They will need some updates however to match the recent changes on the nvidia daily CI workflow files.

ydshieh · 2024-05-13T15:37:47Z

Thank you for all the work you have done @fxmarty and @mht-sharma

fxmarty added 2 commits April 16, 2024 11:58

update to ROCm 6.0.2 and test MI300

32db3be

add callers for mi300

95312a1

fxmarty added 4 commits April 16, 2024 14:45

update dockerfile

b179829

fix trainer tests

2e3e590

remove apex

6e6b08e

style

8f53086

fxmarty commented Apr 17, 2024

View reviewed changes

tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved

fxmarty commented Apr 17, 2024

View reviewed changes

tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved

fxmarty commented Apr 17, 2024

View reviewed changes

tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved

fxmarty commented Apr 17, 2024

View reviewed changes

tests/trainer/test_trainer_seq2seq.py Outdated Show resolved Hide resolved

fxmarty added 4 commits April 17, 2024 16:32

Update tests/trainer/test_trainer_seq2seq.py

a89331f

Update tests/trainer/test_trainer_seq2seq.py

0ff348e

Update tests/trainer/test_trainer_seq2seq.py

781dfcf

Update tests/trainer/test_trainer_seq2seq.py

91b1769

fxmarty requested review from ydshieh and glegendre01 April 17, 2024 08:39

fxmarty commented Apr 19, 2024

View reviewed changes

docker/transformers-pytorch-amd-gpu/Dockerfile Outdated Show resolved Hide resolved

ydshieh self-assigned this Apr 22, 2024

ydshieh approved these changes Apr 25, 2024

View reviewed changes

Merge branch 'main' into mi300-ci

bbfc597

fxmarty added 6 commits April 26, 2024 09:09

update to torch 2.3

92103bf

add workflow dispatch target

83eb885

we may need branches: mi300-ci after all

2e91a50

nit

004f530

fix docker build

bcddef2

nit

bce8ca4

mht-sharma reviewed Apr 26, 2024

View reviewed changes

tests/extended/test_trainer_ext.py Outdated Show resolved Hide resolved

fxmarty added 2 commits April 26, 2024 12:09

add check runner

bfe031a

remove docker-gpu

6e4ad3e

ydshieh reviewed May 2, 2024

View reviewed changes

ydshieh assigned fxmarty May 7, 2024

fxmarty and others added 3 commits May 10, 2024 08:46

Merge branch 'main' into mi300-ci

542b3b4

fix issues

447bad7

fix

0ea628b

ydshieh merged commit 37bba2a into main May 13, 2024
21 checks passed

ydshieh deleted the mi300-ci branch May 13, 2024 16:14

mht-sharma mentioned this pull request May 16, 2024

CI: AMD MI300 tests fix #30797

Merged

ydshieh mentioned this pull request Sep 12, 2024

Revive AMD scheduled CI #33448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: update to ROCm 6.0.2 and test MI300 #30266

CI: update to ROCm 6.0.2 and test MI300 #30266

fxmarty commented Apr 16, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 16, 2024

mht-sharma commented Apr 24, 2024

ydshieh commented Apr 24, 2024 •

edited

Loading

ydshieh left a comment

ydshieh commented Apr 25, 2024

ydshieh commented May 2, 2024

ydshieh May 2, 2024

ydshieh May 7, 2024

mht-sharma commented May 10, 2024 •

edited

Loading

ydshieh commented May 13, 2024

ydshieh commented May 13, 2024

CI: update to ROCm 6.0.2 and test MI300 #30266

CI: update to ROCm 6.0.2 and test MI300 #30266

Conversation

fxmarty commented Apr 16, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Apr 16, 2024

mht-sharma commented Apr 24, 2024

ydshieh commented Apr 24, 2024 • edited Loading

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh commented Apr 25, 2024

ydshieh commented May 2, 2024

ydshieh May 2, 2024

Choose a reason for hiding this comment

ydshieh May 7, 2024

Choose a reason for hiding this comment

mht-sharma commented May 10, 2024 • edited Loading

ydshieh commented May 13, 2024

ydshieh commented May 13, 2024

fxmarty commented Apr 16, 2024 •

edited

Loading

ydshieh commented Apr 24, 2024 •

edited

Loading

mht-sharma commented May 10, 2024 •

edited

Loading