Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

Merged
merged 30 commits into from
Mar 9, 2023

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Nov 15, 2022

Exposes the __ldg* clang builtins to sycl as a cuda only extension via a new function, "sycl::ext::oneapi::experimental::cuda::ldg". This feature does not translate to HIP AMD (HIP introduces the caching function as a no op for AMD backends). AFAIK it doesn't translate to anything in the current level_zero spec.

This extension allows gpgpu applications to make use of the texture cache. This is notably used in Molecular Dynamics as used in LAMMPS (https://github.com/kokkos/kokkos/blob/61d7db55fceac3318c987a291f77b844fd94c165/core/src/Cuda/Kokkos_Cuda_View.hpp) and HOOMD-BLUE (see https://github.com/glotzerlab/hoomd-blue/pull/406/files for a good synopsis of how MD can make full use of this feature).

More generally see the extension document for when usage of "ldg" is advantageous. It is also used in pytorch: pytorch/pytorch#19165

This PR also resolves #7232

JackAKirk added 3 commits November 15, 2022 15:58
Allows gpgpu applications to make use of texture cache.
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
JackAKirk added 3 commits November 15, 2022 17:41
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk changed the title [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_cache_read extension [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension Nov 24, 2022
JackAKirk added 2 commits November 24, 2022 11:29
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 24, 2022

I realized that all signed integer type clang ldg builtins are using the same intrinsic as the unsigned integer types: this means that existing upstream signed integer type clang ldg builtins lead to the wrong ptx instruction. I checked that the ptx instructions are different (correct) in the cuda runtime for the signed integer cases.
Also the bfloat16 and half cases and their vec2 variants need to be able to use the *.nc.b16 and *.nc.b32 instructions respectively. Codegen does not currently consider these cases.
So to fully support signed integer, bfloat16 and half cases we need upstream patches.

Since a really important use case for __ldg is double type, and we don't need to delay adding this till there are the above described fixes, I have made the initial extension only support float/double types.

@zjin-lcf do you have any feedback for this PR?

@JackAKirk JackAKirk marked this pull request as ready for review November 24, 2022 13:02
@JackAKirk JackAKirk requested a review from jchlanda November 24, 2022 13:03
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@zjin-lcf
Copy link
Contributor

I read your well-written doc, and let the author, who posted the issue in hipSYCL, be aware of it.
I have a question. When a kernel argument is const double *__restrict p, could the usage of __ldg be optional ?

Thanks

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 24, 2022

I read your well-written doc, and let the author, who posted the issue in hipSYCL, be aware of it. I have a question. When a kernel argument is const double *__restrict p, could the usage of __ldg be optional ?

Thanks

Thanks

It looks like at the moment in SYCL it is always needed: I'm guessing the way that kernels are submitted currently in SYCL means that it is hard for the compiler to know it can do the optimization without labelling it explicitly via the __ldg instruction.
We could work on trying to improve this, but I don't think it is high priority and hence would not be worked on any time soon. But note that even though it is sometimes possible for nvcc to use the .nc instruction without __ldg, the actual CUDA documentation only mentions using the const and __restrict__ qualifiers as an aid in addition to calling __ldg. I don't see why we would ever want to recommend that users take the chance and hope that it will work without calling __ldg even if it were true that sometimes the compiler might be smart enough to use the cuda texture cache without the explicit instruction.

JackAKirk added 3 commits November 24, 2022 14:44
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@zjin-lcf
Copy link
Contributor

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations.
I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?

auto cacheA = __ldg(&addr[i]);

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 24, 2022

Will any writes to it cause the compiler not to generate "ld.global.nc" ?

This is exactly correct, I cover this explicitly here in this "Important" note: https://github.com/intel/llvm/pull/7397/files#diff-a5636eb0545d0b578041e503b2f07470466c766cb3d599b2c8233c84e8ed393aR109

Do you think it is clear?

@JackAKirk
Copy link
Contributor Author

Will any writes to it cause the compiler not to generate "ld.global.nc" ?

This is exactly correct, I cover this explicitly here in this "Important" note: https://github.com/intel/llvm/pull/7397/files#diff-a5636eb0545d0b578041e503b2f07470466c766cb3d599b2c8233c84e8ed393aR109

Do you think it is clear?

I can explicitly state that the .nc instruction will not be used if you think that is clearer?

@zjin-lcf
Copy link
Contributor

I am not clear that a write to a value stored in a register would affect the read from a cache. In other words, I suppose that the compiler would still generate ld.global.nc. Does a write to a register causes some incoherence ?

@JackAKirk
Copy link
Contributor Author

I am not clear that a write to a value stored in a register would affect the read from a cache. In other words, I suppose that the compiler would still generate ld.global.nc. Does a write to a register causes some incoherence ?

I've checked this by examining the ptx generated in this case, and the compiler does not still generate ld.global.nc if the register returned from __ldg is written to. I expressed this in the "Important:" ascii note.

@zjin-lcf
Copy link
Contributor

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations. I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?

auto cacheA = __ldg(&addr[i]);

I think that if someone didn't use __ldg with nvcc, but did use const __restrict__ qualifiers, there is a very good chance that the .nc instruction wasn't used with nvcc: Basically I think the pointer would have to be declared in the same scope as the read-only condition. As you see in the reported hip sycl issue, even for the nvcc compiler it only worked without __ldg in special cases.

Okay. I will try to look at the PTX codes for CUDA programs more carefully. Thanks.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 24, 2022

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations. I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?

auto cacheA = __ldg(&addr[i]);

I think that if someone didn't use __ldg with nvcc, but did use const __restrict__ qualifiers, there is a very good chance that the .nc instruction wasn't used with nvcc: Basically I think the pointer would have to be declared in the same scope as the read-only condition. As you see in the reported hip sycl issue, even for the nvcc compiler it only worked without __ldg in special cases.

Okay. I will try to look at the PTX codes for CUDA programs more carefully. Thanks.

OK, it would be useful to know to what degree it works on CUDA. Note that we do still have another internal issue to investigate improving the ability to use const __restrict__ qualifiers to switch on the .nc instruction. But this should not block this PR and if we did eventually improve the ability to use const __restrict__ qualifiers to switch on the .nc instruction it would have no effect on the contents of this PR. As in the CUDA runtime docs, we will not recommend that users rely on the compiler to use the texture cache without explicitly calling __ldg.
We need to choose our priorities sensibly, in order to expose the missing functionality users most commonly need asap.

@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1417

@JackAKirk
Copy link
Contributor Author

Would it make sense to add also an overload taking a reference when the type is not a pointer? It could be nice to have this also as an accessor property (obviously, it requires more work, perhaps too much for a niche feature).

Yeah this could be a good idea. I think we will try to merge this now as it is and consider supporting the accessor case for a future feature.

@JackAKirk
Copy link
Contributor Author

@intel/llvm-reviewers-runtime could we get a review for this please?

//CHECK-OPAQUE: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0(ptr %23, i32 4)
auto cached_c4 = ldg(&in_c4[0]);
//CHECK: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* %{{.*}}, i32 8)
//CHECK-OPAQUE: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0(ptr %24, i32 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change the hard-coded pointer values (for example %24) to the regex (%{{.*}}), the values will break if we modify the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I've corrected this now. I've marked level_zero and opencl unsupported for now. I can't reproduce the jenkins failure using my opencl cpu.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk temporarily deployed to aws March 2, 2023 17:06 — with GitHub Actions Inactive
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1417

@JackAKirk JackAKirk temporarily deployed to aws March 4, 2023 12:24 — with GitHub Actions Inactive
@bader bader temporarily deployed to aws March 5, 2023 02:19 — with GitHub Actions Inactive
@bader bader temporarily deployed to aws March 9, 2023 04:05 — with GitHub Actions Inactive
@bader bader temporarily deployed to aws March 9, 2023 04:39 — with GitHub Actions Inactive
@bader bader merged commit 5360825 into intel:sycl Mar 9, 2023
@jinge90
Copy link
Contributor

jinge90 commented Mar 9, 2023

Hi, @JackAKirk
For cuda __ldg* and __st* load/store intrinsic with cache hint, do these load/store intrinsic guarantee to be atomic?
Thanks very much.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Mar 9, 2023

Hi, @JackAKirk For cuda __ldg* and __st* load/store intrinsic with cache hint, do these load/store intrinsic guarantee to be atomic? Thanks very much.

Hi. See this documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability-of-the-model
ld.global.nc is __ldg, which is used to use the l1 texture cache, so clearly __ldg breaks the cuda memory model. However the cache hint load __ld* https://docs.nvidia.com/cuda/cuda-c-programming-guide/#load-functions-using-cache-hints (including ones that tell the compiler to opt out of using the l1 cache) and store __st* https://docs.nvidia.com/cuda/cuda-c-programming-guide/#store-functions-using-cache-hints do not seem to be mentioned with respect to the memory model: so I do not know the answer to your question concretely. We have not implemented them nor investigated them in much detail yet. The only suggested usage of them we came across so far would currently serve no purpose until we improve alias analysis in the cuda backend.

@zjin-lcf
Copy link
Contributor

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA
V = ldg(&arr[0])
#else
V = arr[0]
#endif

@jchlanda
Copy link
Contributor

jchlanda commented Mar 30, 2023

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@zjin-lcf, I'll let @JackAKirk confirm it, but I'd think it's not required, it should be transparent from the user perspective. For non CUDA targets it just returns the value at pointer, see here: https://github.com/intel/llvm/pull/7397/files#diff-d7b46fbdc024084037be6515a3b58667036c6aaeeae19499ea6763b614f9bd8fR211

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Mar 30, 2023

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@JackAKirk
Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.
#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@zjin-lcf, I'll let @JackAKirk confirm it, but I'd think it's not required, it should be transparent from the user perspective. For non CUDA targets it just returns the value at pointer, see here: https://github.com/intel/llvm/pull/7397/files#diff-d7b46fbdc024084037be6515a3b58667036c6aaeeae19499ea6763b614f9bd8fR211

@zjin-lcf

@jchlanda is right, and note that hip does the same thing: See for example hip documentation on this: https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/hip_porting_guide.md#textures-and-cache-control

Note that since Volta the texture cache and shared memory physically share the same unit, this is interesting because it suggests that texture memory is basically reduced to a particular caching strategy (and the read only condition) compared to shared memory. It still appears (see e.g. #8050 : although in this forward prop example I don't understand why it wouldn't be better to use Cuda's static constant cache: but still there appears to be lots of usage of usages of __ldg in pytorch for example, and the fact that the texture cache apparently improves the performance of #8836 a lot makes a lot of sense) that there can be a significant advantage to using the texture cache post Volta, although it would be interesting to investigate how performance of texture cache differs from equivalent shared memory usage in such cases.
One main unknown is whether ldg should use Intel dynamically allocated constant memory (as discussed here : #5827) instead of just returning the pointer: I.e how does CUDA dynamically allocated constant memory (texture cache) map to Intel dynamically allocated constant memory?
If the texture caching strategy remains fundamentally important for Image analysis, Molecular Dynamics, and Deep learning applications into the future (at least accompanied by appropriate z-ordering as appropriate: https://en.wikipedia.org/wiki/Z-order_curve), it would seem likely that eventually there is a oneapi extension that exposes such caching functionality across hardware in a portable way.
We have an internship position to investigate all these questions further: https://uk.indeed.com/viewjob?jk=86bdb5c9617a7c47
If you know of any suitable candidates then please let them know!

Thanks

@zjin-lcf
Copy link
Contributor

@jchlanda @JackAKirk After reading the codes in your link, I learn the implementation. SYCL might add sycl::detail::vector_type_list. The internship post has a paragraph about the study area. I will share the post with other developers/researchers. Thank you!

    sycl::detail::type_list<ldg_vector_types,
                            sycl::detail::gtl::scalar_signed_basic_list,
                            sycl::detail::gtl::scalar_unsigned_basic_list>

zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this pull request Mar 30, 2023
…_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU
zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this pull request Apr 6, 2023
… buffers to USM and update memory accesses in the map kernel with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU
@zjin-lcf
Copy link
Contributor

@JackAKirk

I find that char3/uchar3 are not included.

zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this pull request Apr 13, 2023
…educe global memory accesses explicitly in the kernels; improve SYCL kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on an NVIDIA GPU
zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this pull request Apr 21, 2023
…the kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU; fix warnings for newer compiler versions
@mmoadeli mmoadeli changed the title [SYCL] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants