-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Spec] Update the matrix spec based on new use argument #6662
Conversation
…til implementation of the new one is stable
|
||
```c++ | ||
namespace sycl::ext::oneapi::experimental::matrix { | ||
template <typename Group, typename Ta, typename Tb, typename Tc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Intel hardware support Ta != Tb
? I'm surprised if they would ever be different types?
Also in the CUDA + HIP AMD backends it is not necessary to provide the matrix_layout (doc should be updated to "layout" at some point) template parameter to joint_matrix_mad
: it looks like it is always necessary for intel hardware? if so I can add it to the CUDA implementation so that the interfaces match.
template <typename Group, typename T, size_t NumRows, size_t NumCols, | ||
matrix_use U, matrix_layout L, | ||
access::address_space Space> | ||
void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I forgot that Intel hardware supports store for non accumulator matrices, I'll update #6657 adding the use template parameter and throw a runtime error if it is not use::accumulator
in the CUDA backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We remove L from the template arguments of store and replace it with "unused" in joint matrix.
|
||
Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. | ||
|
||
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. | |
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. |
I guess if you are removing packed_b
and packed_a
this sentence will change like this?
|
||
```c++ | ||
namespace sycl::ext::oneapi::experimental::matrix { | ||
template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential issue here: Rows and Cols have default values, but there isn't a default value for the use
parameter. Assuming the Rows/Cols default value is necessary long term (it isn't in the CUDA backend), should we switch the order of the parameters here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yubingex007-a11y, "use" implementation seems to override the default values for rows and cols. Can you please check that?
I think use should come second in order so we don't have to set a default value for it
matrix_use U, matrix_layout L, | ||
access::address_space Space> | ||
void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res, | ||
multi_ptr<T, Space> src, size_t stride, matrix_layout memL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify the distinction between the matrix_layout L
template parameter and matrix_layout memL
parameter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically do you plan on keeping the matrix_layout L
template parameter in joint_matrix_load
and joint_matrix_store
?
Perhaps you have only kept this in because you think it is necessary for compatibility with the CUDA backend, since I notice that in the latest implementation, #5835, L is set as layout::unused
by default?
In the CUDA implementation I took the approach of overloading joint_matrix_load
: the first overload,
void joint_matrix_load( |
layout
is provided as a normal runtime parameter. This is used in the CUDA backend to load the accumulator matrices.The second overload,
void joint_matrix_load( |
layout
parameter; instead the layouts of the matrices are inferred from the joint_matrix
arguments.
In the Intel backend can the layouts of all matrices be specified at runtime? Or is it the case that, as with the CUDA backend, some of them require that the layout is known at compile time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we remove L from the template arguments of store and replace it with "unused" in joint matrix.
std::size_t M, std::size_t K, std::size_t N, | ||
matrix_layout La, matrix_layout Lb, | ||
matrix_layout Lc> | ||
joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A, | |
void joint_matrix_mad(Group sg, joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group& D joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group>& A, |
Is it acceptable to make the above change to fully support the CUDA functionality as we discussed and as I implemented here: 4949464
Also I think that the A, B, C matrices should also be passed by reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we just add a new type for D matrix but keep it as a returned argument of joint_matrix_mad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the data type held by the D matrix can differ from the data type held by the C matrix. For example if type C is fp16 and type D is fp32; does the intel backend not have such cases? The only way to deal with this is using information provided by the user via arguments to joint_matrix_mad. The cleanest way to do this is the above change where the data type of D can be inferred from the joint_matrix D argument. This change allows us to implement two missing mma operations in the CUDA backend that have been added in this commit: e55e5f0.
Another option is if there is an optional output "precision" fourth argument to joint_matrix_mad
that will specify the type of D: in any case I don't see how a fourth argument to joint_matrix_mad
can be avoided if we want to enable the missing cases introduced in e55e5f0
The only other option I see is to not allow the type of D to differ from the type of C. I don't see how adding a new type for D matrix and keeping it as a returned argument can affect this in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed the first resolution here is to add a new template type Td.
Also, replace Lc, La, Lb by unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?
Yes that's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we should NOT replace them, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I meant to keep the template parameters but set them as default unused since I think that in the Intel backend you don't use them, and always use joint_matrix with layout::unused? In the CUDA backend they can be inferred from the joint_matrix so also don't have to be explicitly specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, good
By the way, I corrected a lot of typos that are still relevant for the new document here: https://github.com/intel/llvm/pull/6525/files Basically if you ignore the new sections I added there (#### Binary Multiply and Add and ##### |
Yes, I will add these corrections here. Thanks. |
I have a question about the old specification, which is now named "sycl_ext_oneapi_deprecated_matrix_no_use". Is that version of the API still supported? Does (or will) the implementation print a warning message for applications that use the old API? Or, is the old API no longer supported, and it is (or will be) removed? Both options are OK for an experimental extension like this. However, we should treat the documentation differently in the two cases. |
IMPORTANT: In the current implementation, only the static extent is supported | ||
|
||
#### Use | ||
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters. | |
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator (C) is required by backend implementations to reason about the layout of the matrix in registers. |
Also (C) is rendered as the copyright sign so this needs to be corrected somehow.
enum class use { | ||
a, | ||
b, | ||
accumulator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulator, | |
accumulator |
row_major, | ||
col_major, | ||
packed, | ||
unused, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused, | |
unused |
|
||
|
||
#### Layout | ||
Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear here as a reader how "symmetric" or "tiled" layouts relate to the values of the enum class "layout" below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove them
} | ||
``` | ||
|
||
IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified | |
IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified |
} | ||
``` | ||
|
||
IMPORTANT: In both AMX and DPAS support, `use` template parameter is required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: In both AMX and DPAS support, `use` template parameter is required | |
IMPORTANT: In both AMX and DPAS support, the `use` template parameter is required |
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS_ | ||
|
||
## Introduction | ||
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. | |
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different matrix hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. |
I personally think it is better to avoid using "tensor" apart from where absolutely necessary :i.e for Tensor Cores (c), because the word "tensor" means something different from "matrix": a matrix is a tensor if there exists transformational rules for the matrix elements with respect to a coordinate system, and this property that distinguishes tensor is actually not relevant at all for the hardware operations and this extension more generally. A (any-dimensional) matrix is not necessarily a tensor although a matrix can also be a tensor: In situations where all matrices operated on are either all tensors or all not tensors it doesn't make much difference whether we refer to "matrices" as "tensors", but for applications where both (non-tensor)"matrices" and "tensors" are used, proper use of language becomes important.
I'm not sure where the trend of replacing (any-dimensional)"matrix" with "tensor" originated, perhaps it was when "tensorflow" sounded better than "matrixflow", but I don't think there is really a reason for adopting this trend in this extension, particularly since we have adopted the naming conventions of joint_matrix
etc rather than e.g. joint_tensor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matrix is technically a 2d tensor and they are both mathematical objects.
tensorflow supports more than 2 dimensions. That's why I believe it is not called matrixflow.
I tend to refer to the hardware as tensor hardware because even though they are matrix hardware (supports only 2 dimensions), they enable ndimensional tensor computing (if the code is inside loops and such). Also, they are called TPUs in general in the literature. That's why the Nvidia tensor cores are called as such for instance. Habana uses tensor cores terminology as well.
Of course, we should refer to this extension as matrix because it enables only 2d computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matrix is only a (representation of a) 2-tensor (in a given basis) if it has a basis independent meaning, likewise for a 3d matrix (3-array) etc. I personally think that Tensor Processing Unit is a terrible name for the literature to have adopted for Processing Units that perform basis dependent computations on 2-dimensional arrays (matrices) that may be part of n-dimensional arrays (n-dimensional matrices) for n>=3, that may or may not be part of a program that grants the n-matrix a tensorial nature. Here is an article that mentions the question of over-use of tensor in ML, https://www.linkedin.com/pulse/tensorsand-applications-ml-demystified-andriy-strepetilov/, with a mostly reasonably description of the meaning of tensor. It's not worth worrying about too much but I guess... for a ML algorithm you can probably define a conservation law (conserving the output) and then define a geometry (set of transformations that satisfy the conservation law) for the inputs such that the inputs are invariant with respect to those transformations (then they are tensors). So maybe "tensor"flow is justified that way.. But for the TPUs I don't see how using "tensor" leads to anything but confusion personally. If we adopt "tensor" to mean n-matrix (n-array) then what will we call something that has tensorial properties...
Probably it doesn't matter at this point because as you say TPU appears to have become accepted as the general term in the literature. But when a word is misappropriated in society, meaning tends to be lost, and society can become poorer for it.
|
||
IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. | ||
|
||
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. | |
IMPORTANT: Since the current Tensor Cores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores AOT backend. |
Actually there are some aspects of the Tensor Cores backend that are not AOT, since there is JIT compilation from the ptx level to SASS level when the code is run. I think this would be more accurate and simpler personally:
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. | |
IMPORTANT: It is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AOT is used here to reflect the SYCL compilation flow: If it does not use SPIRV and target is specified at compile time so this is basically AOT compilation for SYCL. But I will change the sentence to the one that does not talk about the AOT detail.
|
||
|
||
## VNNI/Packed Layout | ||
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. | |
Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. |
|
||
## VNNI/Packed Layout | ||
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. | ||
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. | |
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. |
range<2> L = {1, SG_SIZE}; | ||
int8_t *memA = malloc_shared<int8_t>(M*K, q); | ||
int8_t *memB = malloc_shared<int8_t>(K*N, q); | ||
Int32_t *memC = malloc_shared<int32_t>(M*N, q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int32_t *memC = malloc_shared<int32_t>(M*N, q); | |
int32_t *memC = malloc_shared<int32_t>(M*N, q); |
|
||
- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K. | ||
|
||
- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. | |
- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. |
|`defaultK`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation | ||
|`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A | ||
|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B | ||
|`joint_matrix_c`| validation, default values| type alias for `joint_matrix` for matrix C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this refer to joint_matrix_accumulator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. What do you think @gmlueck ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have this inconsistency throughout the spec. In many places, you refer to the three matrices as a
, b
, and c
. However, the use
enum refers to them as a
, b
, and accumulator
. You should decide on one or the other and then use it consistently throughout. Since you call the first two matrices a
and b
, it seems natural to me to call the third matrix c
. Is there a good reason to call the third matrix accumulator
?
namespace sycl::ext::oneapi::experimental::matrix { | ||
template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, | ||
layout LayoutA, layout LayoutB> | ||
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg, | |
joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg, |
template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, | ||
layout LayoutA, layout LayoutB> | ||
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg, | ||
joint_matrix<Ta, use::a, M, K, layoutA, Group> A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joint_matrix<Ta, use::a, M, K, layoutA, Group> A, | |
joint_matrix<Ta, use::a, M, K, layoutA, Group> &A, |
etc
joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res, | ||
multi_ptr<T, Space> src, size_t stride, layout memL); | ||
|
||
template <typename Group, typename T, size_t NumRows, size_t NumCols, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename Group, typename T, size_t NumRows, size_t NumCols, | |
template <typename Group, typename S, typename T, size_t NumRows, size_t NumCols, | |
use Use, layout Layout, access::address_space Space> | |
void joint_matrix_load(Group sg, | |
joint_matrix<S, Use, NumRows, NumCols, Layout, Group> &res, | |
multi_ptr<T, Space> src, size_t stride); |
We need typename S as an implementation detail so we can support const T cases via: std::enable_if_t<std::is_same<S, std::remove_const_t>::value
@intel/dpcpp-doc-reviewers, any more reviews on this? |
@intel/dpcpp-doc-reviewers, can you please help merge this if there are no more comments? |
@dkhaldi, please move deprecated extension to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/deprecated |
done |
@intel/dpcpp-doc-reviewers, please help merge. |
Ping @intel/dpcpp-specification-reviewers |
Sorry for the delay. This is the next thing on my review list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions below, but also one general comment:
Since you moved the deprecated version of the API spec to "doc/extensions/deprecated", there isn't any need for the "doc/extensions/experimental/sycl_ext_oneapi_matrix" directory. Can you move the spec from:
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
to:
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc
This also means that you can undo the changes to "ReleaseNotes.md".
namespace sycl::ext::oneapi::experimental::matrix { | ||
template <typename T, use Use, | ||
size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, | ||
layout Layout = layout::dynamic, typename Group = sub_group> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me of the rules for using Layout
if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?
We want the default template parameters to result in portable code, and I see that the template parameter here defaults to dynamic
. Is that portable?
Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me of the rules for using
Layout
if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?
Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.
We want the default template parameters to result in portable code, and I see that the template parameter here defaults to
dynamic
. Is that portable?Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.
FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix: As here in the test for the portable interfaces:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.
This restriction isn't conveyed by this spec. Do we expect this specification to describe a common API that includes CUDA?
FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix
I agree that this is useful. This could be achieved by defaulting the Layout
parameter conditionally, depending on the value of Use
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6662 (comment) is relevant here, but I'm sure @dkhaldi will explain the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see under "Layout" subsection:
"Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."
I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).
So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.
@JackAKirk, @gmlueck do you want to proceed differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about prioritizing the portable case. But one of the outcomes of that meeting as well is that the restrictions will still be made in the text of the spec, not in the syntax. Not just yet.
If we make the restriction in the syntax, we are excluding the Intel flexibility.
In the text, we should add that for the CUDA backend, dynamic as default applied to only matrix accumulator.
One of the reasons of that outcome is that I did not want to make more changes at a time.
The change you are suggesting (in the syntax) should be made in a separate PR after we discuss that specific issue more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this as an open issue in the list at the bottom of this spec, then? I just want to make sure we do not forget about issues like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a full implementation of the agreed interfaces: #7077 with a working cuda implementation. See that we still use the layout::dynamic default parameter here:
struct joint_matrix; |
but as I mentioned the cuda implementation won't allow a joint_matrix to be constructed with (use::a || use::b) && layout::dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it to open issues.
Note that this is not the only non portable code that is included in this document. We still need to sort things like "packed" layout.
Once we have the full support of this new API (in a month or so) mentioned by Jack (#7077). I will open up discussion for the non portable scenarios that are still in this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see under "Layout" subsection: "Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."
I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).
So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.
@JackAKirk, @gmlueck do you want to proceed differently?
I think this is fine as you describe.
If you think that some information is more appropriate in a cuda only extension or intel only extension or the main matrix extension, then I don't mind moving it, especially since these extensions are all experimental.
I imagine that it is more traditional that extensions are additive rather that subtractive, but I don't think this is a big issue at the moment: so long as the interfaces (note that I don't say "code" here since DPAS and Tensor Cores don't ever have cases where parameters provided to the interfaces are portable) written by a user are fully portable across cuda and intel for the cases where the hardware allows them to (all the practically useful cases) then this is the main thing.
void joint_matrix_store(Group sg, | ||
joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res, | ||
multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't there two versions of joint_matrix_store
, as you have for joint_matrix_load
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for two versions. In both Intel and CUDA backends, store must take layout as a dynamic argument.
@JackAKirk please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the template parameter of the matrix is something other than dynamic
, though? For example, consider the case when Layout
is row_major
. Does it still make sense to require the user to specify a layout parameter at runtime to joint_matrix_store
? What if the layout they specify at runtime is different from the Layout
template parameter? Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CUDA case, you can only store to matrix accumulator.
For the Intel backend, Besides C, you can store to matrix A or B. In this case, if users decide to do that, they assume it is not portable code, so they would use the version of joint matrix where A and B type layouts are dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not something we needed to consider in the Nvidia backend because the nvidia backend doesn't support native implementations of joint_matrix_store
for anything other than accumulator (and we don't plan on implementing it at all: actually we shouldn't be able to implement it as a software solution because it requires knowledge of the mapping of the joint_matrix registers to the original matrix and this is information that is not released in Nvidia documentation). My understanding is that the intel backend may want to support joint_matrix_store
for use::a and use::b and that it can choose whatever interface it wants for this overload of joint_matrix_store
: for the accumulator case the interfaces that the users uses for joint_matrix_store
is identical for intel and cuda in the agreed version; I think this is all that matters. I thought that that the use::a
/use::b
cases might be added to an intel specific extension document, similar to what we did for cuda in #6968? If you want to add it in the main document instead and just note that it isn't supported for cuda I think that is fine too.
|Value |Description | ||
|1 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The exact API used for this implementation is detailed in [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc) | ||
|2 |joint matrix type has a new `use` parameter. `layout` on matrix is optional for `C` matrix. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported | ||
|3 |Implementation on Nvidia Tensor Cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this specification also apply to Nvidia, or is that still a separate specification? If it is still separate, can there be a link here to that other spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for the sycl_ext_oneapi_matrix_cuda extension that is fully compatible with this extension (sycl_ext_oneapi_matrix) but built on top of it as an optional (cuda only extension) #6968
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?
Yeah sure I can move it to that folder.
The folder is still necessary to add CUDA backend additional specific API (like marray-based element wise operations). see #6968. |
} | ||
``` | ||
|
||
IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default for use::a
and use::b
? If there is no default, you should say that.
Note that this does not restrict the use of layout::dynamic
, it only changes the default. Your commit comment seems to imply that you want to restrict the use of layout::dynamic
. Was that your intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no default for use::a and use::b. the default value dynamic
applies only to matrix::accumulator.
In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.
Is the note in the text good enough?
"IMPORTANT: Matrix layout defaulting to layout::dynamic
applies only to matrix with use::accumulator
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.
If you still want to support layout::dynamic
for use::a
and use::b
, then you have a bug in the definition of joint_matrix_load
because that function only allows loading a matrix with dynamic layout for use::accumulator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the portable API, so it should not allow layout::dynamic
for use::a
and use::b
.
To support this case, it will be added to the intel-specific extension
multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL); | ||
} | ||
``` | ||
This function stores the data from the 2d tiles back to memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your latest changes, it is only possible to store the Accumulator, not A or B. It would be good to say that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@intel/dpcpp-specification-reviewers, please help merge |
…ce (#7077) CUDA backend implementation using the "unified" matrix extension interface. The same interface will be used for a future Intel backend implementation of the matrix extension. - New "unified" interface uses SYCL_EXT_ONEAPI_MATRIX_VERSION=4 - `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad` and `joint_matrix` interfaces match the new spec from #6662 - Separated `joint_matrix_*` functions into new header matrix-unified.hpp: Intel backend implementations can be called from the same functions in the future. - C++17 everywhere in line with #6678 - Updated device code tests to use new interfaces - Completely removed uint16 implementations that are replaced by bfloat16 that is being moved out of the experimental namespace - Updated all CUDA runtime matrix tests here: intel/llvm-test-suite#1183 Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
No description provided.