[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

JackAKirk · 2022-11-15T16:45:42Z

Exposes the __ldg* clang builtins to sycl as a cuda only extension via a new function, "sycl::ext::oneapi::experimental::cuda::ldg". This feature does not translate to HIP AMD (HIP introduces the caching function as a no op for AMD backends). AFAIK it doesn't translate to anything in the current level_zero spec.

This extension allows gpgpu applications to make use of the texture cache. This is notably used in Molecular Dynamics as used in LAMMPS (https://github.com/kokkos/kokkos/blob/61d7db55fceac3318c987a291f77b844fd94c165/core/src/Cuda/Kokkos_Cuda_View.hpp) and HOOMD-BLUE (see https://github.com/glotzerlab/hoomd-blue/pull/406/files for a good synopsis of how MD can make full use of this feature).

More generally see the extension document for when usage of "ldg" is advantageous. It is also used in pytorch: pytorch/pytorch#19165

This PR also resolves #7232

Allows gpgpu applications to make use of texture cache. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-11-24T13:02:31Z

I realized that all signed integer type clang ldg builtins are using the same intrinsic as the unsigned integer types: this means that existing upstream signed integer type clang ldg builtins lead to the wrong ptx instruction. I checked that the ptx instructions are different (correct) in the cuda runtime for the signed integer cases.
Also the bfloat16 and half cases and their vec2 variants need to be able to use the *.nc.b16 and *.nc.b32 instructions respectively. Codegen does not currently consider these cases.
So to fully support signed integer, bfloat16 and half cases we need upstream patches.

Since a really important use case for __ldg is double type, and we don't need to delay adding this till there are the above described fixes, I have made the initial extension only support float/double types.

@zjin-lcf do you have any feedback for this PR?

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

zjin-lcf · 2022-11-24T14:24:22Z

I read your well-written doc, and let the author, who posted the issue in hipSYCL, be aware of it.
I have a question. When a kernel argument is const double *__restrict p, could the usage of __ldg be optional ?

Thanks

JackAKirk · 2022-11-24T14:34:28Z

I read your well-written doc, and let the author, who posted the issue in hipSYCL, be aware of it. I have a question. When a kernel argument is const double *__restrict p, could the usage of __ldg be optional ?

Thanks

Thanks

It looks like at the moment in SYCL it is always needed: I'm guessing the way that kernels are submitted currently in SYCL means that it is hard for the compiler to know it can do the optimization without labelling it explicitly via the __ldg instruction.
We could work on trying to improve this, but I don't think it is high priority and hence would not be worked on any time soon. But note that even though it is sometimes possible for nvcc to use the .nc instruction without __ldg, the actual CUDA documentation only mentions using the const and __restrict__ qualifiers as an aid in addition to calling __ldg. I don't see why we would ever want to recommend that users take the chance and hope that it will work without calling __ldg even if it were true that sometimes the compiler might be smart enough to use the cuda texture cache without the explicit instruction.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

zjin-lcf · 2022-11-24T14:57:07Z

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations.
I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?

auto cacheA = __ldg(&addr[i]);

JackAKirk · 2022-11-24T15:00:57Z

Will any writes to it cause the compiler not to generate "ld.global.nc" ?

This is exactly correct, I cover this explicitly here in this "Important" note: https://github.com/intel/llvm/pull/7397/files#diff-a5636eb0545d0b578041e503b2f07470466c766cb3d599b2c8233c84e8ed393aR109

Do you think it is clear?

JackAKirk · 2022-11-24T15:02:04Z

Will any writes to it cause the compiler not to generate "ld.global.nc" ?

This is exactly correct, I cover this explicitly here in this "Important" note: https://github.com/intel/llvm/pull/7397/files#diff-a5636eb0545d0b578041e503b2f07470466c766cb3d599b2c8233c84e8ed393aR109

Do you think it is clear?

I can explicitly state that the .nc instruction will not be used if you think that is clearer?

zjin-lcf · 2022-11-24T15:10:20Z

I am not clear that a write to a value stored in a register would affect the read from a cache. In other words, I suppose that the compiler would still generate ld.global.nc. Does a write to a register causes some incoherence ?

JackAKirk · 2022-11-24T15:12:58Z

I am not clear that a write to a value stored in a register would affect the read from a cache. In other words, I suppose that the compiler would still generate ld.global.nc. Does a write to a register causes some incoherence ?

I've checked this by examining the ptx generated in this case, and the compiler does not still generate ld.global.nc if the register returned from __ldg is written to. I expressed this in the "Important:" ascii note.

zjin-lcf · 2022-11-24T15:13:06Z

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations. I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?
auto cacheA = __ldg(&addr[i]);
I think that if someone didn't use __ldg with nvcc, but did use const __restrict__ qualifiers, there is a very good chance that the .nc instruction wasn't used with nvcc: Basically I think the pointer would have to be declared in the same scope as the read-only condition. As you see in the reported hip sycl issue, even for the nvcc compiler it only worked without __ldg in special cases.

Okay. I will try to look at the PTX codes for CUDA programs more carefully. Thanks.

JackAKirk · 2022-11-24T15:26:51Z

When users in universities, labs, companies migrate CUDA programs to SYCL, optional usage of __ldg may reduce migration effort. I understand your explanations. I have a question about "cacheA" in the doc. After a value is read using __ldg in a kernel, I suppose that the value ("cacheA") is stored in a register. Will any writes to it cause the compiler not to generate "ld.global.nc" ?
auto cacheA = __ldg(&addr[i]);
I think that if someone didn't use __ldg with nvcc, but did use const __restrict__ qualifiers, there is a very good chance that the .nc instruction wasn't used with nvcc: Basically I think the pointer would have to be declared in the same scope as the read-only condition. As you see in the reported hip sycl issue, even for the nvcc compiler it only worked without __ldg in special cases.
Okay. I will try to look at the PTX codes for CUDA programs more carefully. Thanks.

OK, it would be useful to know to what degree it works on CUDA. Note that we do still have another internal issue to investigate improving the ability to use const __restrict__ qualifiers to switch on the .nc instruction. But this should not block this PR and if we did eventually improve the ability to use const __restrict__ qualifiers to switch on the .nc instruction it would have no effect on the contents of this PR. As in the CUDA runtime docs, we will not recommend that users rely on the compiler to use the texture cache without explicitly calling __ldg.
We need to choose our priorities sensibly, in order to expose the missing functionality users most commonly need asap.

JackAKirk · 2023-03-01T14:08:51Z

/verify with intel/llvm-test-suite#1417

JackAKirk · 2023-03-01T14:10:01Z

Would it make sense to add also an overload taking a reference when the type is not a pointer? It could be nice to have this also as an accessor property (obviously, it requires more work, perhaps too much for a niche feature).

Yeah this could be a good idea. I think we will try to merge this now as it is and consider supporting the accessor case for a future feature.

JackAKirk · 2023-03-01T14:12:07Z

@intel/llvm-reviewers-runtime could we get a review for this please?

jchlanda · 2023-03-02T07:23:18Z

sycl/test/check_device_code/cuda/ldg.cpp

+      //CHECK-OPAQUE: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0(ptr %23, i32 4)
+      auto cached_c4 = ldg(&in_c4[0]);
+      //CHECK: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* %{{.*}}, i32 8)
+      //CHECK-OPAQUE: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0(ptr %24, i32 8)


Could you please change the hard-coded pointer values (for example %24) to the regex (%{{.*}}), the values will break if we modify the test.

Thanks I've corrected this now. I've marked level_zero and opencl unsupported for now. I can't reproduce the jenkins failure using my opencl cpu.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2023-03-03T10:40:03Z

/verify with intel/llvm-test-suite#1417

jinge90 · 2023-03-09T08:38:17Z

Hi, @JackAKirk
For cuda __ldg* and __st* load/store intrinsic with cache hint, do these load/store intrinsic guarantee to be atomic?
Thanks very much.

JackAKirk · 2023-03-09T09:40:49Z

Hi, @JackAKirk For cuda __ldg* and __st* load/store intrinsic with cache hint, do these load/store intrinsic guarantee to be atomic? Thanks very much.

Hi. See this documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability-of-the-model
ld.global.nc is __ldg, which is used to use the l1 texture cache, so clearly __ldg breaks the cuda memory model. However the cache hint load __ld* https://docs.nvidia.com/cuda/cuda-c-programming-guide/#load-functions-using-cache-hints (including ones that tell the compiler to opt out of using the l1 cache) and store __st* https://docs.nvidia.com/cuda/cuda-c-programming-guide/#store-functions-using-cache-hints do not seem to be mentioned with respect to the memory model: so I do not know the answer to your question concretely. We have not implemented them nor investigated them in much detail yet. The only suggested usage of them we came across so far would currently serve no purpose until we improve alias analysis in the cuda backend.

zjin-lcf · 2023-03-29T23:07:45Z

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA
V = ldg(&arr[0])
#else
V = arr[0]
#endif

jchlanda · 2023-03-30T06:34:20Z

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@zjin-lcf, I'll let @JackAKirk confirm it, but I'd think it's not required, it should be transparent from the user perspective. For non CUDA targets it just returns the value at pointer, see here: https://github.com/intel/llvm/pull/7397/files#diff-d7b46fbdc024084037be6515a3b58667036c6aaeeae19499ea6763b614f9bd8fR211

JackAKirk · 2023-03-30T09:33:47Z

@JackAKirk

Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.

#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@JackAKirk
Do you recommend the addition of conditional compile for the use of “ldg()” in a SYCL program ? Thanks.
#ifdef CUDA V = ldg(&arr[0]) #else V = arr[0] #endif

@zjin-lcf, I'll let @JackAKirk confirm it, but I'd think it's not required, it should be transparent from the user perspective. For non CUDA targets it just returns the value at pointer, see here: https://github.com/intel/llvm/pull/7397/files#diff-d7b46fbdc024084037be6515a3b58667036c6aaeeae19499ea6763b614f9bd8fR211

@zjin-lcf

@jchlanda is right, and note that hip does the same thing: See for example hip documentation on this: https://github.com/ROCm-Developer-Tools/HIP/blob/develop/docs/markdown/hip_porting_guide.md#textures-and-cache-control

Note that since Volta the texture cache and shared memory physically share the same unit, this is interesting because it suggests that texture memory is basically reduced to a particular caching strategy (and the read only condition) compared to shared memory. It still appears (see e.g. #8050 : although in this forward prop example I don't understand why it wouldn't be better to use Cuda's static constant cache: but still there appears to be lots of usage of usages of __ldg in pytorch for example, and the fact that the texture cache apparently improves the performance of #8836 a lot makes a lot of sense) that there can be a significant advantage to using the texture cache post Volta, although it would be interesting to investigate how performance of texture cache differs from equivalent shared memory usage in such cases.
One main unknown is whether ldg should use Intel dynamically allocated constant memory (as discussed here : #5827) instead of just returning the pointer: I.e how does CUDA dynamically allocated constant memory (texture cache) map to Intel dynamically allocated constant memory?
If the texture caching strategy remains fundamentally important for Image analysis, Molecular Dynamics, and Deep learning applications into the future (at least accompanied by appropriate z-ordering as appropriate: https://en.wikipedia.org/wiki/Z-order_curve), it would seem likely that eventually there is a oneapi extension that exposes such caching functionality across hardware in a portable way.
We have an internship position to investigate all these questions further: https://uk.indeed.com/viewjob?jk=86bdb5c9617a7c47
If you know of any suitable candidates then please let them know!

Thanks

zjin-lcf · 2023-03-30T13:03:22Z

@jchlanda @JackAKirk After reading the codes in your link, I learn the implementation. SYCL might add sycl::detail::vector_type_list. The internship post has a paragraph about the study area. I will share the post with other developers/researchers. Thank you!

    sycl::detail::type_list<ldg_vector_types,
                            sycl::detail::gtl::scalar_signed_basic_list,
                            sycl::detail::gtl::scalar_unsigned_basic_list>

…_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU

… buffers to USM and update memory accesses in the map kernel with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU

zjin-lcf · 2023-04-12T15:26:37Z

@JackAKirk

I find that char3/uchar3 are not included.

…educe global memory accesses explicitly in the kernels; improve SYCL kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on an NVIDIA GPU

…the kernel performance with the sycl_ext_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU; fix warnings for newer compiler versions

JackAKirk added 3 commits November 15, 2022 15:58

Added cache_read extension for cuda only.

1709fde

Allows gpgpu applications to make use of texture cache. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format

4c689d3

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format

d99eda4

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk requested review from a team as code owners November 15, 2022 16:45

JackAKirk requested a review from dm-vodopyanov November 15, 2022 16:45

JackAKirk mentioned this pull request Nov 15, 2022

[SYCL][CUDA] read-only vectors (marked with 'const' and '__restrict__') translated to the instruction 'ld.global.nc' #7232

Closed

JackAKirk marked this pull request as draft November 15, 2022 17:07

JackAKirk added 3 commits November 15, 2022 17:41

corrected integer mistake.

60f4338

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Finalized initial ext: float/double only.

d240d5a

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Merge branch 'sycl' into ext-ldg

d27a92a

JackAKirk changed the title ~~[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_cache_read extension~~ [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension Nov 24, 2022

JackAKirk added 2 commits November 24, 2022 11:29

Format.

4cf5a47

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Fix include.

44db271

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk marked this pull request as ready for review November 24, 2022 13:02

JackAKirk requested a review from jchlanda November 24, 2022 13:03

Added back renamed extension doc.

1989ea8

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk added 3 commits November 24, 2022 14:44

Fix doc unwanted italics.

b0e3426

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Added backslashes to avoid italics.

49be13f

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Try dollar signs to avoid italics.

d4d8eb8

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

jchlanda approved these changes Mar 1, 2023

View reviewed changes

dm-vodopyanov approved these changes Mar 1, 2023

View reviewed changes

jchlanda reviewed Mar 2, 2023

View reviewed changes

used %{{.*}} for registers.

4ea7bcd

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to aws March 2, 2023 17:06 — with GitHub Actions Inactive

JackAKirk temporarily deployed to aws March 4, 2023 12:24 — with GitHub Actions Inactive

Merge branch 'sycl' into ext-ldg

c009bf0

bader temporarily deployed to aws March 5, 2023 02:19 — with GitHub Actions Inactive

JackAKirk mentioned this pull request Mar 6, 2023

Support buffer location on CUDA #5827

Open

Merge branch 'sycl' into ext-ldg

9f00821

bader temporarily deployed to aws March 9, 2023 04:05 — with GitHub Actions Inactive

bader temporarily deployed to aws March 9, 2023 04:39 — with GitHub Actions Inactive

bader merged commit 5360825 into intel:sycl Mar 9, 2023

zjin-lcf pushed a commit to zjin-lcf/HeCBench that referenced this pull request Mar 30, 2023

[convolutionSeparable] improve the kernel performance with h sycl_ext…

a65165c

…_oneapi_cuda_tex_cache_read extension (intel/llvm#7397) on a NVIDIA GPU

JackAKirk mentioned this pull request May 9, 2023

[SYCL] Added tests for ldg builtins. #9374

Closed

mmoadeli changed the title ~~[SYCL] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension~~ [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

JackAKirk commented Nov 15, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022 •

edited

Loading

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 •

edited

Loading

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Mar 1, 2023

JackAKirk commented Mar 1, 2023

JackAKirk commented Mar 1, 2023

jchlanda Mar 2, 2023

JackAKirk Mar 2, 2023

JackAKirk commented Mar 3, 2023

jinge90 commented Mar 9, 2023

JackAKirk commented Mar 9, 2023 •

edited

Loading

zjin-lcf commented Mar 29, 2023

jchlanda commented Mar 30, 2023 •

edited

Loading

JackAKirk commented Mar 30, 2023 •

edited

Loading

zjin-lcf commented Mar 30, 2023

zjin-lcf commented Apr 12, 2023

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

Conversation

JackAKirk commented Nov 15, 2022 • edited Loading

JackAKirk commented Nov 24, 2022 • edited Loading

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 • edited Loading

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 • edited Loading

JackAKirk commented Nov 24, 2022

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022

zjin-lcf commented Nov 24, 2022

JackAKirk commented Nov 24, 2022 • edited Loading

JackAKirk commented Mar 1, 2023

JackAKirk commented Mar 1, 2023

JackAKirk commented Mar 1, 2023

jchlanda Mar 2, 2023

Choose a reason for hiding this comment

JackAKirk Mar 2, 2023

Choose a reason for hiding this comment

JackAKirk commented Mar 3, 2023

jinge90 commented Mar 9, 2023

JackAKirk commented Mar 9, 2023 • edited Loading

zjin-lcf commented Mar 29, 2023

jchlanda commented Mar 30, 2023 • edited Loading

JackAKirk commented Mar 30, 2023 • edited Loading

zjin-lcf commented Mar 30, 2023

zjin-lcf commented Apr 12, 2023

JackAKirk commented Nov 15, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Nov 24, 2022 •

edited

Loading

JackAKirk commented Mar 9, 2023 •

edited

Loading

jchlanda commented Mar 30, 2023 •

edited

Loading

JackAKirk commented Mar 30, 2023 •

edited

Loading