diff --git a/sycl/ReleaseNotes.md b/sycl/ReleaseNotes.md index 3476099ea4a09..16147005a12c0 100644 --- a/sycl/ReleaseNotes.md +++ b/sycl/ReleaseNotes.md @@ -293,7 +293,7 @@ extension. [1d993446] [4f7787c8] - Implemented `ext::oneapi::experimental::radix_sorter` from the [`sycl_ext_oneapi_group_sort`](doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc) extension proposal. [86ba1809] -- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) +- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/7dab76e1d33341b1e6bf339ab933552281abb3e2/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) extension for CUDA. [166bbc36] - Added support for sorting over sub-groups. [168767c6] - Added C++ API wrappers for the Intel math functions `ceil`, `floor`, `rint`, @@ -407,7 +407,7 @@ extension proposal to allow the compiler to determine the initiation interval. - Updated the [`sycl_ext_intel_usm_address_spaces`](doc/extensions/supported/sycl_ext_intel_usm_address_spaces.asciidoc) extension to adhere to SYCL 2020 `multi_ptr`. [4a9e9a0e] - Added a new matrix use parameter to `joint_matrix` from the -[`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) +[`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) extension specification. [52f34fd5] - Removed `queue::size` and `queue::get_wait_list` functions from the `sycl_ext_oneapi_queue_status_query` extension due to performance overhead @@ -654,7 +654,7 @@ Release notes for commit range [`4043dda3..0f579bae`](https://github.com/intel/l to mark `has_property` API as `noexcept`. [7805aa3f] - Updated [`sycl_ext_intel_device_info`](doc/extensions/supported/sycl_ext_intel_device_info.md) to support querying free device memory. [0eeef2b3] -- Updated [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc) +- Updated [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/770f540d8b600c8c16df12dfccbf38fa780cf77a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc) with description of new matrix features. [770f540d] - Moved [`sycl_ext_oneapi_invoke_simd`](doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc) extensions specification from `proposed` to `experimental` because @@ -1300,7 +1300,7 @@ Release notes for commit range 23ca0c2..27f59d8 Level Zero, ESIMD emulator, HIP [2b0ebab376dc] - Added support for `sycl::ext::intel::experimental::esimd_ballot` function [0bbb091c1baa] - - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) + - Added initial support for [Tensor Cores matrix extension](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) [711ba58c30a8] ### Documentation @@ -1692,7 +1692,7 @@ Release notes for commit range 4fc5ebe..bd68232 - Added [sRGBA support](doc/extensions/supported/sycl_ext_oneapi_srgb.asciidoc) [e488327][191efdd] - Added a preview feature implementation for the DPC++ experimental - [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) + [matrix extension](https://github.com/intel/llvm/blob/467ef25a309ec882027052f3d4c3df58c11ee2ac/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) [7f218531] [a95f46d] - Added support for SYCL 2020 exceptions [5c0f748][eef07606][5af8c43d] - Added support for [sycl_ext_intel_bf16_conversion extension](doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc) @@ -1956,7 +1956,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78 for querying of free device memory in LevelZero backend extension [fa428bf] - Added [InvokeSIMD](doc/extensions/proposed/sycl_ext_oneapi_invoke_simd.asciidoc) and [Uniform](doc/extensions/proposed/sycl_ext_oneapi_uniform.asciidoc) extensions [72e1611] - - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733] + - Added [Matrix Programming Extension for DPC++ document](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) [ace4c733] - Implemented SYCL 2020 `sycl::span` [9356d53] - Added [device-if](doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) extension [4fb95fc] @@ -2102,7 +2102,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78 - Fixed build issue when CUDA 11 is used [f7224f1] - Fixed caching of sub-devices in Level Zero backend[4c34f93] - Fixed requesting of USM memory allocation info on CUDA [691f842] - - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) + - Fixed [`joint_matrix_mad`](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) behaviour to return `A*B+C` instead of assigning the result to `C` [ea59c2b] - Workaround an issue in Level Zero backend when event isn't waited upon its completion but is queried for its status in an infinite loop [bfef316] diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc new file mode 100644 index 0000000000000..d6de22bdae391 --- /dev/null +++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc @@ -0,0 +1,327 @@ += sycl_ext_intel_matrix + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +Copyright (c) 2022-2023 Intel Corporation. All rights reserved. + +NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are +trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. +used by permission by Khronos. + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +This extension also depends on the following other SYCL extensions: + +* link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix] + +== Status +This is an experimental extension specification, intended to provide early +access to features and gather community feedback. Interfaces defined in this +specification are implemented in {dpcpp}, but they are not finalized and may +change incompatibly in future versions of {dpcpp} without prior notice. +*Shipping software products should not rely on APIs defined in this +specification.* + +== Backend support status +This document describes the extra features and details for the +implementation of `joint_matrix` extension on Intel AMX and Intel +XMX. + +The APIs in this extension may be used only on a device that has +`aspect::ext_intel_matrix`. The application must check that the device +has this aspect before submitting a kernel using any of the APIs in +this extension. If the application fails to do this, the +implementation throws a synchronous exception with the +`errc::kernel_not_supported` error code when the kernel is submitted to +the queue. + +== Overview +This extension provides additional APIs related to the `joint_matrix` +type that can be used only on Intel devices that have Intel AMX or +Intel XMX technology. These Intel devices also support all of the +generic matrix APIs specified in `sycl_ext_oneapi_matrix`, but +applications can make use of the extended Intel specific APIs in this +extension to gain additional performance and capabilities. + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must +predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values +defined in the table below. Applications can test for the existence of +this macro to determine if the implementation supports this feature, +or applications can test the macro's value to determine which of the +extension's APIs the implementation supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|The APIs of this experimental extension are not versioned, so the + feature-test macro always has this value. +|=== + +=== New Aspect for Intel-Specific Matrix APIs +This extension adds a new device aspect: +```c++ +namespace sycl { + +enum class aspect : /*unspecified*/ { + ext_intel_matrix +}; + +} // namespace sycl +``` +The `ext_intel_matrix` aspect indicates that the device is capable of +using the extended joint matrix APIs that are defined in the sections +that follow. + +=== New Layout Type +This extension adds a new layout type named `ext_intel_packed` which +an application can use to indicate that the matrix data is loaded or +stored in VNNI "packed" format. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix::layout { + +enum class layout { + ext_intel_packed +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +Consequently, the layout argument `layout` in `joint_matrix_load` can +take `ext_intel_packed` as argument to specify that the data has +already been transformed into VNNI format. In this case, the `stride` +argument of `joint_matrix_load` describes the number of elements +between consecutive rows for packed layouts. + +In order to get maximum performance on Intel AMX and Intel XMX, +prepacking data in the memory is necessary. If users did not specify +the packed layouts, transforms done by the implementation may be slow +due to extra scatter/gather operations. Hence, we expose the +`ext_intel_packed` layout to the user to specify that A or B have +already been VNNIed. The packed or VNNI layout is introduced in the +`VNNI layout` section below. + +=== Additional Store Operations +Besides store of matrix `accumulator`, the Intel implementation allows +store on matrix `a` and `b` as well. + +```c++ +namespace sycl::ext::intel::experimental::matrix { + +template +void joint_matrix_store(Group g, + const joint_matrix &res, + multi_ptr src, size_t stride); + +template +void joint_matrix_store(Group g, + const joint_matrix &res, + multi_ptr src, size_t stride); + +} // namespace sycl::ext::intel::experimental::matrix +``` + +=== Per-element Access with Coordinates +The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides +a way for the application to apply the same operation on every element +of the matrix. However, some algorithms require the application to +know the coordinates of each element as it operates on them. In this +case, the joint matrix index must be known in order to reason about +the matrix view and extract the relevant piece such as a sum of all +elements in a row for example. For instance, quantization that is +needed for conversion between low precision types like `int8_t` and `fp32` +uses such logic. + +This extension adds a new form of the `joint_matrix_apply` function in +the `sycl::ext::intel::matrix` namespace that allows the application +to perform an operation on each element of the matrix. This function +is similar to the form in `sycl_ext_oneapi_joint_matrix`, but it also +provides the matrix coordinates of each element to the callback +function: + +```c++ +namespace sycl::ext::intel::experimental::matrix { + +template +void joint_matrix_apply(Group g, joint_matrix& C, F&& func); + +} // namespace sycl::ext::intel::experimental::matrix +``` +The `func` callback is invoked with three parameters `(T& element, +size_t row, size_t col)`, where `row` and `col` tell the coordinates +of element in the joint matrix. To illustrate, the following example +shows how you can use this API to sum the rows of a matrix: + +```c++ +joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t col) { + sum_local_rows[row] += val; +}); +``` +=== New Device Information Descriptor +Besides the query we provide in +link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix], +some device descriptors are Intel hardware specific. These are +provided as part of `ext::intel::experimental::info::device::matrix` +namespace: + +[frame="none",options="header"] +|====================== +| Device descriptors | Return type| Description +|`ext::intel::experimental::info::device::matrix::numtiles`| `int` +|If the matrix hardware in the device has separate storage (register +files or tiles) from the rest of the processing units (e.g. Intel +AMX), returns the number of tiles. For other devices, returns 0. +|====================== + +=== Packed Layout Format +The `ext_intel_packed` layout (aka VNNI) is a special layout for +matrix data that allows Intel AMX and Intel XMX devices to load +matrices more efficiently (packing in 32 bits). This layout applies +only to the A and B matrices, and may not be used with the accumulator +matrix. The layout is different depending on whether the matrix +element type is 8 bits or 16 bits, which are the only two element +sizes supported for the A and B matrices on Intel AMX and Intel XMX +devices. + +For an 8-bit element, the first four elements of column 0 are stored +contiguously in memory, followed by the first four elements of column +1, etc. This continues until the end of the row. After all the +elements for rows 0 - 3 have been stored this way, the process +repeats, starting with the next four elements of column 0. The diagram +below illustrates this layout for a 8 x 4 matrix. + +==== Example 1: 8-bit elements + + // Example of a 8 row x 4 column matrix using a 8-bit data + // element, in row-major layout, rows are shown horizontally. + // Element a1 is contiguous in memory with element b1, etc. + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // a5, b5, c5, d5 + // a6, b6, c6, d6 + // a7, b7, c7, d7 + // a8, b8, c8, d8 + // --------------------------------- + // The same matrix reformatted in packed layout. + // Here, packing of 4 elements is needed to form 32 bits. + // Elements a1, a2, a3, a4 are contiguous in memory, etc. + // --------------------------------- + // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 + // a5, a6, a7, a8, b5, b6, b7, b8, c5, c6, c7, c8, d5, d6, d7, d8 + +For a 16-bit element, the first two elements of column 0 are stored +contiguously in memory, followed by the first two elements of column +1, etc. This continues until the end of the row. After all the +elements for rows 0 - 1 have been stored this way, the process +repeats, starting with the next two elements of column 0. The diagram +below illustrates this layout for a 4 x 4 matrix. + +==== Example 2: 16-bit elements + // Example of a 4 row x 4 column matrix using a 16-bit data + // element, in row-major layout. + // Element a1 is contiguous in memory with element b1, etc. + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // --------------------------------- + // The same matrix reformatted in packed layout. + // Here, packing of 2 elements is needed to form 32 bits. + // Element a1 is contiguous in memory with element a2, etc. + // --------------------------------- + // a1, a2, b1, b2, c1, c2, d1, d2 + // a3, a4, b3, b4, c3, c4, d3, d4 + +=== Example using int8_t type +```c++ +using namespace sycl::ext::oneapi::experimental::matrix; + +queue q; +range<2> G = {M/tM, N}; +range<2> L = {1, SG_SIZE}; +auto bufA = sycl::buffer{memA, sycl::range{M*K}}; +auto bufB = sycl::buffer{memB, sycl::range{K*N}}; +auto bufC = sycl::buffer{memC, sycl::range{M*N}}; +q.submit([&](sycl::handler& cgh) { + auto accA = sycl::accessor{bufA, cgh, sycl::read_only}; + auto accB = sycl::accessor{bufB, cgh, sycl::read_only}; + auto accC = sycl::accessor{bufC, cgh, sycl::read_write}; + cgh.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) + [[sycl::reqd_sub_group_size(SG_SIZE)]] { + const auto global_idx = item.get_global_id(0); + const auto global_idy = item.get_global_id(1); + const auto sg_startx = global_idx - item.get_local_id(0); + const auto sg_starty = global_idy - item.get_local_id(1); + sub_group sg = item.get_sub_group(); + joint_matrix tA; + joint_matrix tB; + joint_matrix tC; + joint_matrix_fill(sg, tC, 0); + for (int k = 0; k < K; k += tK) { + joint_matrix_load(sg, tA, accA + sg_startx * tM * K + k, K); + joint_matrix_load(sg, tB, accB + k * N*4 + sg_starty/SG_SIZE*tN*4, N*4); + tC = joint_matrix_mad(sg, tA, tB, tC); + } + auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC); + for (int i = 0; i < wi_data_c.length(); i++) + wi_data_c[i] *= alpha; + joint_matrix_store(sg, tC, + accC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major); + }); +}); +q.wait(); +``` +== Revision History + +[frame="none",options="header"] +|====================== +|Rev |Date |Author |Changes +|1 |2022-11-07 |Dounia Khaldi |Add Intel-specific store API, +layout information, and `joint_matrix_apply` with coordinates API +|====================== diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc new file mode 100644 index 0000000000000..94c2bebe04906 --- /dev/null +++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc @@ -0,0 +1,976 @@ += sycl_ext_oneapi_matrix + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (c) 2021-2023 Intel Corporation. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +== Status +This is an experimental extension specification, intended to provide early +access to features and gather community feedback. Interfaces defined in this +specification are implemented in {dpcpp}, but they are not finalized and may +change incompatibly in future versions of {dpcpp} without prior notice. +*Shipping software products should not rely on APIs defined in this +specification.* + +== Backend support status +This extension is currently implemented in {dpcpp} only for devices +that contain a matrix hardware, specifically Intel(R) Advanced Matrix +Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) +XMX) and Nvidia(R) Tensor Cores. + +The `joint_matrix` type and the `joint_matrix_mad` function are +optional kernel features as defined in section 5.7 of the core SYCL +specification. Each device supports only certain values for the `M`, +`N`, and `K` template parameters and only certain types for the `Ta`, +`Tb`, and `Tc` template parameters. Applications can use the query API +in `matrix_params` or `get_info` +to determine the set of legal parameters for each device. If the +application submits a kernel using an unsupported `joint_matrix` type +or calls `joint_matrix_mad` with an unsupported combination, the +implementation throws a synchronous exception with the +`errc::kernel_not_supported` error code as described in section 5.7. + +== Overview +Joint matrix is a SYCL extension for matrix hardware programming. It +unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and +Nvidia Tensor Cores. This provides a portable and performant API for +users who want to build their own neural networks applications, +perform custom optimizations, or experiment with new operations in a +timely and performing manner. + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine +the macro `SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the +table below. Applications can test for the existence of this macro to +determine if the implementation supports this feature, or applications +can test the macro's value to determine which of the extension's +features the implementation supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|The APIs of this experimental extension are not versioned, so the + feature-test macro always has this value. +|=== + +=== New `joint_matrix` class +This extension adds a new class named `joint_matrix`, which represents +a small 2-dimensional matrix that supports native operations in +hardware. There are a number of template parameters, namely the group +scope, the type of the elements, the matrix use, the shape, and the +memory layout of the matrix. This results in the following description: + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +struct joint_matrix { + joint_matrix(); + joint_matrix(const joint_matrix &) = delete; + joint_matrix &operator=(const joint_matrix &) = delete; +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +The constructor for the `joint_matrix` type is a group function as +defined in section 4.17.3 of the core SYCL specification. It must be +encountered in converged control flow by all work-items in the +`Group`. + +==== Group Memory Scope +Most operations on the joint_matrix are group functions, meaning that +all work items in a group collectively perform an operation on the +same matrix. The `Group` template parameter specifies the execution +scope of the work-items in the group. The `joint_matrix` is shared among the +work items in the group and is not private to each work item. This +extension currently supports only the sub-group scope, so the `Group` +template parameter must be `sycl::sub_group`, and group operations for +the joint matrix must be done collectively by the work-items in a +single sub-group. In this case, a matrix is declared as follows: + +```c++ +joint_matrix tA; +``` + +==== Element Type +The `T` template parameter specifies the type of each element in the +matrix. Each device supports only certain element types, so the +application must ensure that the element type is supported on the +device where the kernel using this joint_matrix runs. The query +functions (defined below) may be used to determine the set of element +types that are supported on a device. + +==== Matrix Use +The main operation performed by the matrix hardware is `D=C+A*B`. The +`Use` template parameter specifies which of these terms (A, ,B, C, or D) +corresponds to the `joint_matrix` object. The use enumeration defines +the set of legal values. The A matrix must have the value `use::a`. The +B matrix must have the value `use::b`. The C and D matrices must both +have the value `use::accumulator`. This is used by backend +implementations to reason about the layout of the matrix in +registers. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +enum class use { + a, + b, + accumulator +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +==== Matrix Shape +The `Rows` and `Cols` template parameters provide the number of rows +and columns in the joint matrix. Each device supports only certain +combinations of row and column sizes, so the application must ensure +that the combination is supported on the device where the kernel using +this `joint_matrix` runs. The query functions (defined below) may be +used to determine the set of combinations that are supported on a +device. + +==== Matrix Layout +The `Layout` template parameter specifies the memory layout of the +matrix, using one of the values in the layout enumeration. The A and B +matrices can be either `layout::row_major` or `layout::col_major` (but not +`layout::dynamic`). The C and D matrices must be `layout::dynamic`. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +enum class layout { + row_major, + col_major, + dynamic +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +Note that the `Layout` template parameters defaults to `layout::dynamic` +when Use is `use::accumulator`, so applications need not specify this +template parameter for the C or D matrices, and it is invalid to +specify any other value for `Layout`. When `Use` has any other value, +there is no default for `Layout`, and the application must specify one +explicitly. + +=== Collective matrix operations +The following operations (load, store, multiply-and-add, fill, and +element-wise operations) are group functions as defined in section +4.17.3 of the core SYCL specification. As such, they must be +encountered in convergent control flow by the work-items in the group +that performs the group operation. + +==== Load +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +// Only available when std::is_same_v> +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride, layout Layout); + +// Only available when Layout != layout::dynamic +// and when std::is_same_v> +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +`joint_matrix_load` loads data from memory to the registers of the +matrix hardware. +We define two overloads of the load function depending on whether the +memory layout was declared as part of the `joint_matrix` type or not. +The first overload that takes memory layout as an argument is only +available for a `joint_matrix` type that used the default value +`layout::dynamic`. +The second overload without a memory layout must not be used with a +`joint_matrix` type that has `layout::dynamic`. + +The base pointer `src` of type `T` here determines the starting address of the +matrix to be loaded from. `Layout` determines whether the data is +being read in a row (`row_major`), column major (`col_major`) +fashion. `stride` describes the number of elements between consecutive +rows for the row major layout, or between columns for the column major +layout. + +==== Store +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_store(Group g, + const joint_matrix &res, + multi_ptr dest, size_t stride, layout Layout); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +This function stores the data in the accumulator matrix from the +registers back to memory. + +The base pointer `dest` here determines the starting address of the +matrix to be stored. `Layout` determines whether the data is being +written in a row (`row_major`), column major (`col_major`) +fashion. `stride` describes the number of elements between consecutive +rows for the row major layout, or between columns for the column major layout. + + +==== Multiply and Add + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +joint_matrix +joint_matrix_mad(Group g, + const joint_matrix &A, + const joint_matrix &B, + const joint_matrix &C); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +The matrix multiply and add function performs the multiply operation +on the matrices `A` and `B`, accumulates the result with `C` and returns +the result. + +Each device supports only certain combinations of types for the `A`, +`B`, and `C` matrices. The application must use the query operations +(defined below) to ensure that the combination of types is supported +on the device where the kernel calling `joint_matrix_mad` runs. + +==== Fill (Initialization) +Unlike `joint_matrix_load` that assumes that all the matrices are +directly loaded from memory, `joint_matrix_fill` makes it possible to +multiply a matrix which is not directly loaded from memory but rather +initialized directly in the register. Note that the value type `Tv` +must be convertible to the matrix elements type `T`. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_fill(Group g, joint_matrix &m, Tv v); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +==== Copy +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_copy(Group g, + joint_matrix &src, + joint_matrix &dest); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +This function copies `Rows x Cols` elements of type `T1` from joint +matrix `src` to `Rows x Cols` elements of type `T2` of joint matrix +`dest`. The two matrices must have the same scope, shape, and +layout. Use and type can be different so this function converts +between different `use` of matrices. + +==== Element-Wise Operations +Besides matrix multiply and add, this extension aims to make it +possible to perform element-wise operations on matrices in a SPMD +manner. `joint_matrix_apply` function performs an element-wise +operation where the same operation is performed on every element of +the joint matrix, such that the operation can be performed without knowledge +of the position of the element within the matrix. Activation functions +or adding a constant value to every element of the matrix are two +examples of this usage. When the operation depends on the element +index of the matrix, an Intel-specific extension is available as part +of the link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix] + +Besides the `Group` and the `joint_matrix` arguments, +`joint_matrix_apply` takes a C++ Callable object which is invoked once +for each element of the matrix. This callable object must be invocable +with a single parameter of type `T&`. Commonly, applications pass a +lambda expression. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_apply(Group g, joint_matrix& C, F&& func); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +In the following example, every element of the matrix `C` is +multiplied by `alpha`. Then, an activation function, `relu` in this +example, is applied on each of the elements of `C`. + +```c++ +joint_matrix_apply(sg, C, [=](T &x) { + x *= alpha; + relu(x); +}); +``` + +=== Support for the TF32 Data Type +Some devices support the TF32 floating point type for matrix +elements. This type has a 19 bit format with one sign bit, 8 exponent +bits (offering the same range as float), and 10 mantissa bits +(offering the same precision as sycl::half). Use of this type can +accelerate the joint_matrix_mad operation by reducing its +precision. In order to declare a `joint_matrix` object with this +element type, use `matrix::precision::tf32` in place of the `T` +template parameter. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix::precision { + +class tf32; + +} // namespace sycl::ext::oneapi::experimental::matrix::precision +``` + +For example: + +```c++ +joint_matrix tA; +``` + +Whenever the application loads, stores, fills, or accesses the +elements of a TF32 matrix, the application sees the elements as +float. There are special overloads of these functions for TF32 for +this purpose. + +==== TF32 load +These overloads of `joint_matrix_load` load float values into a TF32 +matrix. It is unspecified whether the implementation loads all 32 bits +into the joint matrix or if it only loads the relevant 19 bits. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride, layout Layout); + +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride, layout Layout); + +// Only available when Layout != layout::dynamic +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride); + +// Only available when Layout != layout::dynamic +template +void joint_matrix_load(Group g, + joint_matrix &res, + multi_ptr src, size_t stride); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +==== TF32 store +This overload of joint_matrix_store stores float values from a TF32 +matrix. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +void joint_matrix_store(Group g, + const joint_matrix &res, + multi_ptr dest, size_t stride, layout Layout); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +==== TF32 fill +When `joint_matrix_fill` is called for a TF32 matrix, the type `Tv` +(the type of the fill value) must be implicitly convertible to +`float`. It is unspecified whether the implementation writes all 32 +bits of the value into the joint matrix or if it only writes the +relevant 19 bits. + +==== TF32 element-wise operations +When `joint_matrix_apply` is called for a TF32 matrix, the Callable +object func is called with a single argument of type `float &`. When the +application changes this value, it is unspecified whether the +implementation writes back all 32 bits of the element into the joint +matrix or if it only write the relevant 19 bits. + +In the example below, `C` is a joint matrix of type `precision::tf32`. + +```c++ +joint_matrix_apply(sg, C, [=](float &x) { + x *= alpha; +}); +``` +==== Rounding TF32 values +The functions `joint_matrix_load`, `joint_matrix_fill`, and +`joint_matrix_apply` do not define any rounding mode when the float +values are converted to TF32, and the implementation may either round +or truncate these conversions. If an application wants more control +over this rounding, it can use the `round_to_tf32` function. This +performs the round to nearest even (RTE) rounding mode. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +float round_to_tf32(float elem); + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +=== Example using `int8_t` type +```c++ +using namespace sycl::ext::oneapi::experimental::matrix; + +queue q; +range<2> G = {M/tM, N}; +range<2> L = {1, SG_SIZE}; +int8_t *memA = malloc_shared(M*K, q); +int8_t *memB = malloc_shared(K*N, q); +int32_t *memC = malloc_shared(M*N, q); +q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) + [[sycl::reqd_sub_group_size(SG_SIZE)]] { + const auto global_idx = item.get_global_id(0); + const auto global_idy = item.get_global_id(1); + const auto sg_startx = global_idx - item.get_local_id(0); + const auto sg_starty = global_idy - item.get_local_id(1); + sub_group sg = item.get_sub_group(); + joint_matrix tA; + joint_matrix tB; + joint_matrix tC; + joint_matrix_fill(sg, tC, 0); + for (int k = 0; k < K; k += tK) { + joint_matrix_load(sg, tA, + multi_ptr(memA) + + sg_startx * tM * K + k, K); + joint_matrix_load(sg, tB, + multi_ptr(memB) + + k * N + sg_starty/SG_SIZE*tN, N); + tC = joint_matrix_mad(sg, tA, tB, tC); + } + joint_matrix_apply(sg, tC, [=](int8_t x) { + x *= alpha; + }); + joint_matrix_store(sg, tC, + multi_ptr(memC) + + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major); +}).wait(); +``` + +=== Query Interface +Most devices support only certain values for the `Rows` and `Cols` +template parameters and only certain types for the `T` template +parameter. Moreover, most devices support only certain combinations of +these template parameter for the A, B, C, and D matrices in the +`joint_matrix_mad` function (see Appendix: Supported Combinations Per +Hardware). This extension adds two query APIs that can be used to +determine the set of legal parameters for a particular device. One +form provides `constexpr` values for these parameters, which can be +used when the application knows the specific device architecture on +which it will run. The other form uses the standard information +descriptor queries for the device object. + +The description below uses the terms `M`, `N`, and `K` to identify the +matrix dimensions of a multiply and add operation `D = C + A*B`. The +`D` and `C` matrices are `M` rows by `N` columns. The `A` matrix is +`M` rows by `K` columns, and the `B` matrix is `K` rows by `N` columns. + +==== Compile-Time Query +This returns `constexpr` values to use in `joint_matrix` template +arguments but depends on an enumeration of the matrix hardware (See +`sycl::ext::oneapi::experimental::architecture`) in the +link:../sycl_ext_oneapi_device_architecture.asciidoc[sycl_ext_oneapi_device_architecture] +extension that can be tested. +The compile-time query interface proposed here consists of two +functionalities: + +- Validation: at compile time, the validation functionality informs + the user whether a specific combination is valid or not. This takes + place when the user specifies all template parameters. + +- Default values: this provides a default shape if the user does not + provide a specific combination. In this case, aliases to the + `joint_matrix` type can be used, namely + `joint_matrix_a/b/c/d` where no additional argument is needed. This + form happens when the user specifies all template parameters except + the sizes of the matrices M, N, and K. + +The table below provides a description for each of the member +variables in `matrix_params` class and the forms in which they are +defined. + +[frame="none",options="header",cols="40%,60%"] +|====================== +| Member/type alias in `matrix_params` | Description +a| +[source] +---- +static constexpr size_t M +---- +|when no sizes are provided by the user, indicates the suggested +default size for M; usually this corresponds to the maximum size the +implementation supports. In validation mode, where the user does +provide sizes, this is the same value M that the user provides if M is +supported by the implementation + +a| +[source] +---- +static constexpr size_t N +---- +|when no sizes are provided by the user, indicates the suggested +default size for N; usually this corresponds to the maximum size the +implementation supports. In validation mode, where the user does +provide sizes, this is the same value N that the user provides if N is +supported by the implementation + +a| +[source] +---- +static constexpr size_t K +---- +|when no sizes are provided by the user, indicates the suggested +default size for K; usually this corresponds to the maximum size the +implementation supports. In validation mode, where the user does +provide sizes, this is the same value K that the user provides if K is +supported by the implementation + +a| +[source] +---- +template +using joint_matrix_a +---- +|type alias for `joint_matrix` for matrix A + +a| +[source] +---- +template +using joint_matrix_b +---- +|type alias for `joint_matrix` for matrix B + +a| +[source] +---- +template +using joint_matrix_c +---- +|type alias for `joint_matrix` for the input matrix accumulator + +a| +[source] +---- +template +using joint_matrix_d +---- +|type alias for `joint_matrix` for the output matrix accumulator +|====================== + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +template +struct matrix_params; + +// This is the validation form, when all template parameters are +// specified. +template +struct matrix_params { + // An implementation typically uses static_assert here to trigger a + // compilation error when the matrix types or shapes are not + // supported by the device identified by the architecture "Arch". + + static constexpr size_t M = sM; + static constexpr size_t N = sN; + static constexpr size_t K = sK; + + template + using joint_matrix_a = joint_matrix; + + template + using joint_matrix_b = joint_matrix; + + template + using joint_matrix_c = joint_matrix; + + template + using joint_matrix_d = joint_matrix; +}; + +// This is the default values form, where the matrix dimensions are +// omitted. +template +struct matrix_params { + // An implementation typically uses static_assert here to trigger a + // compilation error when the matrix types are not supported by the + // device identified by the architecture "Arch". + + static constexpr size_t M = /* implementation defined */; + static constexpr size_t N = /* implementation defined */; + static constexpr size_t K = /* implementation defined */; + + template + using joint_matrix_a = joint_matrix; + + template + using joint_matrix_b = joint_matrix; + + template + using joint_matrix_c = joint_matrix; + + template + using joint_matrix_d = joint_matrix; +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` +===== Validation Example: +```c++ +// User can provide sizes besides the types and matrix_params can assert +// if they are supported or not +// in this case, an assertion will happens as 16 is not a supported size for M +using myparams = matrix_params; +size_t NDRangeM = M / myparams::M; //Assertion would happen at this line +size_t NDRangeN = N / myparams::N; +``` + +===== Default Values Example: +```c++ +using myparams = matrix_params; +// use this to construct the ranges on the host side +size_t NDRangeM = M / myparams::M; +size_t NDRangeN = N / myparams::N; +//if M, N, K do not multiply the default sizes, padding has to be done +// device code: the matrices are constructed using the default dimensions +myparams::joint_matrix_a sub_a; +myparams::joint_matrix_b sub_b; +myparams::joint_matrix_c sub_c; + +``` +==== Runtime Query +The runtime query does not require the application to hard-code a +specific device type, but it also returns values that are not +`constexpr`. It provides similar information as the compile time query +API via an extended device information descriptor. + +The table below provides a description for each of the device matrix +descriptors that can be queried using `get_info` API. + +[frame="none",options="header"] +|====================== +| Device descriptors | Return type| Description +|`ext::oneapi::experimental::info::device::matrix::combinations` | +`std::vector`| tells the set of supported matrix sizes +and types on this device +|====================== + +The runtime query returns a vector of `combinations` of `combination` +type. Each combination includes the sizes and the types for the +matrices A, B, C, and D. Note that for each matrix hardware, +the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, +ksize` exclusively, depending on whether the implementation supports a +continuous or discrete number of sizes. If a device support a +continuous number of sizes, the `max_*` variant is applied and only +the maximum number is returned. However, if a device supports a +discrete list of numbers so the `msize, nsize, ksize` variant is applied. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + +enum class matrix_type { + bf16, + fp16, + tf32, + fp32, + fp64, + sint8, + sint16, + sint32, + sint64, + uint8, + uint16, + uint32, + uint64 +}; +struct combination { + size_t max_msize; + size_t max_nsize; + size_t max_ksize; + size_t msize; + size_t nsize; + size_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + matrix_type dtype; +}; + +} // namespace sycl::ext::oneapi::experimental::matrix +``` + +Each combination of the `combinations` vector composes the types and +sizes of A, B, C, and D matrices supported by the device +implementation. The table below provides a description of each member +of the `combination` struct. + +[frame="none",options="header"] +|====================== +| Member of `combination` | Description +|`max_msize`, `max_nsize`, `max_ksize`| if the matrix implementation +supports a continuous number of element sizes, each of these members +is non-zero, and the matrix implementation supports all element sizes +from 1 up to (and including) that number. By contrast, if the matrix +hardware implementation supports a discrete number of element sizes, +each of these members has the value zero +|`msize`, `nsize`, `ksize`| if the matrix implementation supports a +discrete number of element sizes, each of these members is non-zero, +and the value tells one of the supported element sizes. By contrast, +if the matrix hardware supports a continuous number of element sizes, +each of these members has the value zero +|`atype`, `btype`, `ctype`, `dtype`| indicates the types supported in +the combination. these are of type `matrix_type` which tells the list +of types that are supported for the A, B, C, and D matrices in +the `T` template parameter as follows: + +`bf16`: `sycl::bfloat16` + +`fp16`: `sycl::half` + +`tf32`: `sycl::ext::oneapi::experimental::matrix::precision::tf32` + +`fp32`: `float` + +`fp64`: `double` + +`sint8`: `int8_t` + +`sint16`: `int16_t` + +`sint32`: `int32_t` + +`sint64`: `int64_t` + +`uint8`: `uint8_t` + +`uint16`: `uint16_t` + +`uint32`: `uint32_t` + +`uint64`: `uint64_t` +|====================== + +===== Runtime Query Example: +```c++ +// Ta, Tb, Tc, and Td are the types used in applications +std::vector combinations = + device.get_info(); +for (int i = 0; sizeof(combinations); i++) { + if (Ta == combinations[i].atype && + Tb == combinations[i].btype && + Tc == combinations[i].ctype && + Td == combinations[i].dtype) { + // joint matrix GEMM kernel can be called using these sizes + joint_matrix_gemm(combinations[i].msize, + combinations[i].nsize, combinations[i].ksize); + } +} +``` + +=== Appendix: Supported Combinations Per Hardware +The table below provides a list of the combinations that +`joint_matrix` implementations support on each of Intel AMX and Intel +XMX hardware. Note that these can be returned using +`ext::oneapi::experimental::info::device::matrix::combinations`. + +==== Intel AMX Supported Combinations +This is currently available in devices with the architecture +`architecture::intel_cpu_spr`. In this architecture's implementation, +the type of the C matrix must be the same as the type of the D +matrix. Therefore, that common type is shown in a single column in the +table below. + +[frame="none",options="header"] +|====================== +| A type | B type | C and D type | M | N | K +| `matrix_type::uint8` | `matrix_type::uint8` | +`matrix_type::sint32` | +<=+ 16 | +<=+ 16 | +<=+ 64 +| `matrix_type::uint8` | `matrix_type::int8` | +`matrix_type::sint32` | +<=+ 16 | +<=+ 16 | +<=+ 64 +| `matrix_type::int8` | `matrix_type::uint8` | +`matrix_type::sint32` | +<=+ 16 | +<=+ 16 | +<=+ 64 +| `matrix_type::int8` | `matrix_type::int8` | +`matrix_type::sint32` | +<=+ 16 | +<=+ 16 | +<=+ 64 +| `matrix_type::bf16` | `matrix_type::bf16` | +`matrix_type::fp32` | +<=+ 16 | +<=+ 16 | +<=+ 32 +|====================== + +==== Intel XMX Supported Combinations +This is currently available in devices with the architecture +`architecture::intel_gpu_pvc` and `architecture::intel_gpu_dg2`. In +these architectures' implementation, the type of the C matrix must be +the same as the type of the D matrix. Therefore, that common type is +shown in a single column in the table below. + +[frame="none",options="header"] +|====================== +| A type | B type | C and D type | M | N | K | device +| `matrix_type::uint8` | `matrix_type::uint8` | +`matrix_type::int32` | +<=+ 8 | 16 | 32 | architecture::intel_gpu_pvc +| | | | |8||architecture::intel_gpu_dg2 +| `matrix_type::uint8` | `matrix_type::int8` | +`matrix_type::int32` | +<=+ 8 | 16 | 32 | architecture::intel_gpu_pvc +| | | | |8||architecture::intel_gpu_dg2 +| `matrix_type::int8` | `matrix_type::uint8` | +`matrix_type::int32` | +<=+ 8 | 16 | 32 | architecture::intel_gpu_pvc +| | | | |8||architecture::intel_gpu_dg2 +| `matrix_type::int8` | `matrix_type::int8` | +`matrix_type::int32` | +<=+ 8 | 16 | 32 | architecture::intel_gpu_pvc +| | | | |8||architecture::intel_gpu_dg2 +| `matrix_type::fp16` | `matrix_type::fp16` | +`matrix_type::fp32` | +<=+ 8 | 16 | 16 | architecture::intel_gpu_pvc +| | | | |8|| architecture::intel_gpu_dg2 +| `matrix_type::bf16` | `matrix_type::bf16` | +`matrix_type::fp32` | +<=+ 8 | 16 | 16 | architecture::intel_gpu_pvc +| | | | |8|| architecture::intel_gpu_dg2 +|====================== + +==== Nvidia Tensor Cores Supported Combinations +The complete set of matrix data types and shapes that are supported by +the `ext_oneapi_cuda` backend are represented in the following +table. In this architecture's implementation, +the type of the A matrix must be the same as the type of the B +matrix. Also, the type of the C matrix must be the same as the type of the D +matrix. + +IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target +arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must +be used, where `sm_xx` must be a Compute Capability that is equal to +or greater than the appropriate Minimum Compute Capability. When an +executable has been compiled for `sm_xx`, if the executable is run on +a device with compute capability less than `sm_xx` then an error will +be thrown. The mapping to Minimum Compute Capability from each +supported parameter combination is specified in the following table. + + +[frame="none",options="header"] +|====================== +| A and B type | C and D type | M | N | K | Minimum Compute Capability +.3+| `matrix_type::fp16` .3+| `matrix_type::fp32` +|16 |16 |16 .6+| sm_70 +|8 |32 |16 +|32 |8 |16 +.3+| `matrix_type::fp16` .3+| `matrix_type::fp16` +|16 |16 |16 +|8 |32 |16 +|32 |8 |16 +.3+| `matrix_type::int8` .3+| `matrix_type::int32` +|16 |16 |16 .6+| sm_72 +|8 |32 |16 +|32 |8 |16 +.3+|`matrix_type::uint8` .3+|`matrix_type::int32` +|16 |16 |16 +|8 |32 |16 +|32 |8 |16 +| `matrix_type::tf32` | `matrix_type::fp32` |16 |16 |8 .5+| sm_80 +.3+|`matrix_type::bf16` .3+| `matrix_type::fp32` +|16 |16 |16 +|8 |32 |16 +|32 |8 |16 +| `matrix_type::fp64` | `matrix_type::fp64` |8 |8 |4 +|====================== + +IMPORTANT: The `stride` argument to `joint_matrix_load` and +`joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a +multiple of 4 when `T` is `float`; where `T` is the type of the +`joint_matrix` elements. When `T` is not `half` or `float` there are +no restrictions to `stride`. + +=== Revision History + +[frame="none",options="header"] +|====================== +|Rev |Date |Author |Changes +|1 |2021-04-13 |Dounia Khaldi |Initial public working draft. +|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS +|3 |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise +operations support +|4 |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the +new matrix use parameter and remove reference to the AOT AMX initial +implementation +|5 |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it +portable across Intel AMX, Intel XMX and Nvidia Tensor Cores, and move +the Intel-specifics to a separate extension document +|6 |2023-01-09 |Dounia Khaldi |Add `joint_matrix_apply` API, tf32 +type, runtime query, and supported combinations appendix for Intel AMX +and Intel XMX +|7 |2023-04-11 |Jack Kirk |Add Nvidia Tensor Cores supported combinations +|====================== diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc deleted file mode 100644 index 883c73c655217..0000000000000 --- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc +++ /dev/null @@ -1,155 +0,0 @@ -# Additional Intel-only specifics about matrix extension for DPC++ - -:source-highlighter: coderay -:coderay-linenums-mode: table -:dpcpp: pass:[DPC++] - -// This section needs to be after the document title. -:doctype: book -:toc2: -:toc: left -:encoding: utf-8 -:lang: en - -:blank: pass:[ +] - -// Set the default source code type in this document to C++, -// for syntax highlighting purposes. This is needed because -// docbook uses c++ and html5 uses cpp. -:language: {basebackend@docbook:c++:cpp} - - -== Notice - -Copyright (c) 2021-2022 Intel Corporation. All rights reserved. - -NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are -trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. -used by permission by Khronos. - -This extension is written against the SYCL 2020 revision 5 specification. All -references below to the "core SYCL specification" or to section numbers in the -SYCL specification refer to that revision. - -**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX. - This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. - -## Introduction -The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document. -This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends. - -## Feature test macro - -This extension provides a feature-test macro as described in the core SYCL -specification section 6.3.3 "Feature test macros". Therefore, an -implementation supporting this extension must predefine the macro -`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below. -Applications can test for the existence of this macro to determine if the -implementation supports this feature, or applications can test the macro's -value to determine which of the extension's APIs the implementation supports. - -[frame="none",options="header"] -|====================== -|Value |Description -|1 |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B. -|====================== - - -## Extra Functionality - -### Layout -Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section. - -```c++ -namespace sycl::ext::intel::experimental::matrix { -enum class layout { - packed -}; -} -``` - - -### Layout argument in `joint_matrix_load` -`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts. - -In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. - -IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`. - -### Store Operation -Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. - -#### Store -```c++ -namespace sycl::ext::intel::experimental::matrix { - template - void joint_matrix_store(Group sg, - joint_matrix &res, - multi_ptr src, size_t stride); -} -``` - - -## VNNI/Packed Layout -Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. -The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. - -#### Example 1: 16-bit elements - // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout. - // Element a1 is contiguous in memory with element b1, etc. - // --------------------------------- - // a1, b1, c1, d1 - // a2, b2, c2, d2 - // a3, b3, c3, d3 - // a4, b4, c4, d4 - // --------------------------------- - // The same matrix reformatted in packed layout. - // Here, packing of 2 elements is needed to form 32 bits. - // Element a1 is contiguous in memory with element a2, etc. - // --------------------------------- - // a1, a2, b1, b2, c1, c2, d1, d2 - // a3, a4, b3, b4, c3, c4, d3, d4 - -#### Example 2: 8-bit elements - - // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout. - // Element a1 is contiguous in memory with element b1, etc. - // --------------------------------- - // a1, b1, c1, d1 - // a2, b2, c2, d2 - // a3, b3, c3, d3 - // a4, b4, c4, d4 - // --------------------------------- - // The same matrix reformatted in packed layout. - // Here, packing of 4 elements is needed to form 32 bits. - // Elements a1, a2, a3, a4 are contiguous in memory, etc. - // --------------------------------- - // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 - -## Supported Combinations Per Hardware - -The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class. - -### Intel AMX Supported Combinations - -[frame="none",options="header"] -|====================== -| A type | B type | Accumulator type | M | N | K -| (u)int8_t | (u)int8_t | int32_t | +<=+ 16 | +<=+ 16 | +<=+ 64 -| bf16 | bf16 | fp32 | +<=+ 16 | +<=+ 16 | +<=+ 32 -|====================== - -### Intel XMX Supported Combinations - -[frame="none",options="header"] -|====================== -| A type | B type | Accumulator type | M | N | K -| (u)int8_t | (u)int8_t | int32_t | +<=+ 8 | 16 | 32 -| fp16 | fp16 | fp32 | +<=+ 8 | 16 | 16 -| bf16 | bf16 | fp32 | +<=+ 8 | 16 | 16 -|====================== - -## Open Questions -- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent` to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.