-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397
Conversation
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
I realized that all signed integer type clang ldg builtins are using the same intrinsic as the unsigned integer types: this means that existing upstream signed integer type clang ldg builtins lead to the wrong ptx instruction. I checked that the ptx instructions are different (correct) in the cuda runtime for the signed integer cases. Since a really important use case for __ldg is double type, and we don't need to delay adding this till there are the above described fixes, I have made the initial extension only support float/double types. @zjin-lcf do you have any feedback for this PR? |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
I read your well-written doc, and let the author, who posted the issue in hipSYCL, be aware of it. Thanks |
Thanks It looks like at the moment in SYCL it is always needed: I'm guessing the way that kernels are submitted currently in SYCL means that it is hard for the compiler to know it can do the optimization without labelling it explicitly via the __ldg instruction. |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of
|
This is exactly correct, I cover this explicitly here in this "Important" note: https://github.com/intel/llvm/pull/7397/files#diff-a5636eb0545d0b578041e503b2f07470466c766cb3d599b2c8233c84e8ed393aR109 Do you think it is clear? |
I can explicitly state that the .nc instruction will not be used if you think that is clearer? |
I am not clear that a write to a value stored in a register would affect the read from a cache. In other words, I suppose that the compiler would still generate ld.global.nc. Does a write to a register causes some incoherence ? |
I've checked this by examining the ptx generated in this case, and the compiler does not still generate |
Okay. I will try to look at the PTX codes for CUDA programs more carefully. Thanks. |
OK, it would be useful to know to what degree it works on CUDA. Note that we do still have another internal issue to investigate improving the ability to use |
/verify with intel/llvm-test-suite#1417 |
Yeah this could be a good idea. I think we will try to merge this now as it is and consider supporting the accessor case for a future feature. |
@intel/llvm-reviewers-runtime could we get a review for this please? |
//CHECK-OPAQUE: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0(ptr %23, i32 4) | ||
auto cached_c4 = ldg(&in_c4[0]); | ||
//CHECK: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* %{{.*}}, i32 8) | ||
//CHECK-OPAQUE: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0(ptr %24, i32 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change the hard-coded pointer values (for example %24
) to the regex (%{{.*}}
), the values will break if we modify the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I've corrected this now. I've marked level_zero and opencl unsupported for now. I can't reproduce the jenkins failure using my opencl cpu.
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
/verify with intel/llvm-test-suite#1417 |
Hi, @JackAKirk |
Hi. See this documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability-of-the-model |
Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks. #ifdef CUDA |
@zjin-lcf, I'll let @JackAKirk confirm it, but I'd think it's not required, it should be transparent from the user perspective. For non CUDA targets it just returns the value at pointer, see here: https://github.com/intel/llvm/pull/7397/files#diff-d7b46fbdc024084037be6515a3b58667036c6aaeeae19499ea6763b614f9bd8fR211 |
@jchlanda is right, and note that hip does the same thing: See for example hip documentation on this: https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/hip_porting_guide.md#textures-and-cache-control Note that since Volta the texture cache and shared memory physically share the same unit, this is interesting because it suggests that texture memory is basically reduced to a particular caching strategy (and the read only condition) compared to shared memory. It still appears (see e.g. #8050 : although in this forward prop example I don't understand why it wouldn't be better to use Cuda's static constant cache: but still there appears to be lots of usage of usages of __ldg in pytorch for example, and the fact that the texture cache apparently improves the performance of #8836 a lot makes a lot of sense) that there can be a significant advantage to using the texture cache post Volta, although it would be interesting to investigate how performance of texture cache differs from equivalent shared memory usage in such cases. Thanks |
@jchlanda @JackAKirk After reading the codes in your link, I learn the implementation. SYCL might add
|
…_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU
… buffers to USM and update memory accesses in the map kernel with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU
I find that char3/uchar3 are not included. |
…educe global memory accesses explicitly in the kernels; improve SYCL kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on an NVIDIA GPU
…the kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU; fix warnings for newer compiler versions
Exposes the __ldg* clang builtins to sycl as a cuda only extension via a new function, "sycl::ext::oneapi::experimental::cuda::ldg". This feature does not translate to HIP AMD (HIP introduces the caching function as a no op for AMD backends). AFAIK it doesn't translate to anything in the current level_zero spec.
This extension allows gpgpu applications to make use of the texture cache. This is notably used in Molecular Dynamics as used in LAMMPS (https://github.com/kokkos/kokkos/blob/61d7db55fceac3318c987a291f77b844fd94c165/core/src/Cuda/Kokkos_Cuda_View.hpp) and HOOMD-BLUE (see https://github.com/glotzerlab/hoomd-blue/pull/406/files for a good synopsis of how MD can make full use of this feature).
More generally see the extension document for when usage of "ldg" is advantageous. It is also used in pytorch: pytorch/pytorch#19165
This PR also resolves #7232