-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfault, remove nonsymmetric ginkgo solver #548
Conversation
b61e131
to
d162cea
Compare
PNNL Pipelines seem to be giving a false negative here. Tests failed initially, but are passing upon a retry. I would suggest adding a new commit with https://github.com/LLNL/hiop/blob/develop/.gitlab-ci.yml#L159 changed to not include Thankfully this doesn't require a rebuild, so this should be a quick check. |
Tests are passing on |
Building with cuda_arch 60,70,75 and 80 on all Marianas GPUs
Seems like CUDA_ARCHITECTURES cannot readily be passed from the CLI in the platform variables script into the CMake config. As a workaround, I have specified all Cuda architectures for Marianas in |
I was able to get tests passing when running manually on previously failing platform. I am not able to re-produce old issue with updated module set. I assume that since ginkgo is built with the fully array of cuda architectures, that the CUDA runtime is smart enough to pick up the right spec, despite HiOp being built with the "wrong" cuda arch. Unless people have qualms with how I hacked this together, I think this should be good to merge. As a side note, ExaGO now has updated modules ready to go with a fresh build of latest hiop@develop. |
Thanks @cameronrutherford! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, just pointing out points of confusion
set(CMAKE_CUDA_ARCHITECTURES 60 70 75 80 CACHE STRING "") | ||
message(STATUS "Setting default cuda architecture to ${CMAKE_CUDA_ARCHITECTURES}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually seem necessary to resolve the issue. It seems like it is sufficient to just have updated Ginkgo module. Not sure if we want to keep this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is anything wrong with this. It will just build a binary that can run on different NVIDIA cards.
# ginkgo@1.5.0.glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60,70,75,80 arch=linux-centos7-zen2 | ||
module load ginkgo-1.5.0.glu_experimental-gcc-10.2.0-x73b7k3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of just the updated Ginkgo module
This is not a big issue. If you want to distribute HiOp binary, for example, this is exactly how you would build it. |
All tests passing in CI |
Passing spack builds in GitHub actions! |
This PR should fix issues #540 and #542. It also removes the nonsymmetric ginkgo solver as previously suggested by @pelesh.
The problem were a missing host side copy of the rhs vector for #540 and directly accessing matrix and vector values located in GPU memory in
update_matrix
for#542. This is fine as long as they are in unified memory, which the apparently are when built with Debug. I added a host side copy of the matrix that gets updated and moved back to the GPU instead.