Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing cuda 102 #6649

Merged
merged 3 commits into from
Sep 28, 2022
Merged

Removing cuda 102 #6649

merged 3 commits into from
Sep 28, 2022

Conversation

atalman
Copy link
Contributor

@atalman atalman commented Sep 26, 2022

In preparation for Release 1.13 removing CUDA 10.2

Follow up work required after this is merged:
#6655

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much more than just removing 102

As we have no plans of yanking manylinux-cuda102 images from dockerhub, perhaps we can keep them as defaults for CPU builds (but investigate perhaps in separate PRs, why CPU docker image could not be used)

Also, I though you've already moved cmake tests to cuda116 in another PR haven't you? If this has not been done yet, again, can it be separate so that we have a clear signal that move did not cause any regressions.

.circleci/config.yml.in Outdated Show resolved Hide resolved
.circleci/config.yml.in Outdated Show resolved Hide resolved
.circleci/regenerate.py Outdated Show resolved Hide resolved
@datumbox
Copy link
Contributor

The failing prototype tests are unrelated and PyTorch Data is looking into it.

The problem at unittest_linux_gpu_py3.8 is related:

RuntimeError: The current installed version of g++ (9.3.1) is greater than the maximum required version by CUDA 10.2 (8.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=8.0.0).

Possibly related to @atalman's #6649 (comment)

@atalman
Copy link
Contributor Author

atalman commented Sep 26, 2022

manylinux-cuda116

The failing prototype tests are unrelated and PyTorch Data is looking into it.

The problem at unittest_linux_gpu_py3.8 is related:

RuntimeError: The current installed version of g++ (9.3.1) is greater than the maximum required version by CUDA 10.2 (8.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=8.0.0).

Possibly related to @atalman's #6649 (comment)

Yes looks like for some reason unittest_linux_gpu_py3.8 CU_VERSION env var is set to cu102. Trying to find out why now.

@atalman atalman requested a review from malfet September 26, 2022 22:16
.circleci/unittest/linux/scripts/install.sh Outdated Show resolved Hide resolved
@@ -37,7 +33,8 @@ printf "Installing PyTorch with %s\n" "${cudatoolkit}"
if [ "${os}" == "MacOSX" ]; then
conda install -y -c "pytorch-${UPLOAD_CHANNEL}" "pytorch-${UPLOAD_CHANNEL}"::pytorch "${cudatoolkit}"
else
conda install -y -c "pytorch-${UPLOAD_CHANNEL}" -c nvidia "pytorch-${UPLOAD_CHANNEL}"::pytorch[build="*${version}*"] "${cudatoolkit}"
printf "conda install -y pytorch ${cudatoolkit} -c pytorch-${UPLOAD_CHANNEL} -c nvidia"
conda install -y pytorch "${cudatoolkit}" -c "pytorch-${UPLOAD_CHANNEL}" -c nvidia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be equivalent to the previous statement (i.e. it remove exact pytorch version contraint)

Copy link
Contributor Author

@atalman atalman Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need it here honestly. I think we better stick to simpler solution same as get started page. Would be easier to maintain. ${version} here is just cuda version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here. @atalman are you saying you tested and it's unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes totally old install command would yield:

conda install -y -c pytorch-nightly -c nvidia -c pytorch-nightly::pytorch[build="*cu117*"] pytorch-cuda=11.7

new one is the same as in our get started page:

conda install pytorch pytorch-cuda=11.7 -c pytorch-nightly -c nvidia

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would insist that we should keep old command (you can submit a BE change separately)
New commands opens up regression we've seen in torchaudio yesterday, when CPU version were installed instead of cuda one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unit test if it fails means we have real problem with pytorch binary

.circleci/regenerate.py Show resolved Hide resolved
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, with my limited knowledge of the building scripts it looks good. Just a question below.

I think we could consider merging it after #6660 and after @malfet unblocks. We should aim for all CI to be green prior merging it, to avoid accidental breakges.

@@ -37,7 +33,8 @@ printf "Installing PyTorch with %s\n" "${cudatoolkit}"
if [ "${os}" == "MacOSX" ]; then
conda install -y -c "pytorch-${UPLOAD_CHANNEL}" "pytorch-${UPLOAD_CHANNEL}"::pytorch "${cudatoolkit}"
else
conda install -y -c "pytorch-${UPLOAD_CHANNEL}" -c nvidia "pytorch-${UPLOAD_CHANNEL}"::pytorch[build="*${version}*"] "${cudatoolkit}"
printf "conda install -y pytorch ${cudatoolkit} -c pytorch-${UPLOAD_CHANNEL} -c nvidia"
conda install -y pytorch "${cudatoolkit}" -c "pytorch-${UPLOAD_CHANNEL}" -c nvidia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here. @atalman are you saying you tested and it's unnecessary?

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@datumbox
Copy link
Contributor

@atalman @malfet Thanks for the changes.

Note that we have a couple of issues with flaky tests these days. A couple of models are throwing OOM errors on GPU. We suspect it's related to CircleCI and it comes and goes. If you see such an issue, you can safely ignore it. We are monitoring it to see how often and how regularly it breaks and we will decide to disable it if necessary.

Display cuda info

Address comments try to resolve CUDA version issue

More work

Base debugging

Fix cuda version passing

Testing

Adding config.yml

Adding command we use for pytorch vision install

Adding unit tests

Modify install command

Refactor config.in

Move cpu tests to different PR

Remove debug code

Testing similar exception for linux as windows

update test_models.py

Revert "Testing similar exception for linux as windows"

This reverts commit 4aaee0b.

Revert "update test_models.py"

This reverts commit 413651a.
@atalman atalman merged commit dc07ac2 into pytorch:main Sep 28, 2022
@github-actions
Copy link

Hey @atalman!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@datumbox datumbox added topic: build other if you have no clue or if you will manually handle the PR in the release notes labels Sep 28, 2022
facebook-github-bot pushed a commit that referenced this pull request Oct 7, 2022
Summary:
* Removing cuda 102

Display cuda info

Address comments try to resolve CUDA version issue

More work

Base debugging

Fix cuda version passing

Testing

Adding config.yml

Adding command we use for pytorch vision install

Adding unit tests

Modify install command

Refactor config.in

Move cpu tests to different PR

Remove debug code

Testing similar exception for linux as windows

update test_models.py

Revert "Testing similar exception for linux as windows"

This reverts commit 4aaee0b.

Revert "update test_models.py"

This reverts commit 413651a.

* Removing debug statement

* Reverting to old command

Reviewed By: datumbox

Differential Revision: D40138739

fbshipit-source-id: 8ebe2de9a5fedb0da906825929599e44e9cb0207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed other if you have no clue or if you will manually handle the PR in the release notes topic: build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants