Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Spec] Update the matrix spec based on new use argument #6662

Merged
merged 28 commits into from
Oct 31, 2022

Conversation

dkhaldi
Copy link
Contributor

@dkhaldi dkhaldi commented Aug 29, 2022

No description provided.


```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename Ta, typename Tb, typename Tc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Intel hardware support Ta != Tb? I'm surprised if they would ever be different types?

Also in the CUDA + HIP AMD backends it is not necessary to provide the matrix_layout (doc should be updated to "layout" at some point) template parameter to joint_matrix_mad: it looks like it is always necessary for intel hardware? if so I can add it to the CUDA implementation so that the interfaces match.

template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_use U, matrix_layout L,
access::address_space Space>
void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot that Intel hardware supports store for non accumulator matrices, I'll update #6657 adding the use template parameter and throw a runtime error if it is not use::accumulator in the CUDA backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We remove L from the template arguments of store and replace it with "unused" in joint matrix.


Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.

IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.

I guess if you are removing packed_b and packed_a this sentence will change like this?


```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential issue here: Rows and Cols have default values, but there isn't a default value for the use parameter. Assuming the Rows/Cols default value is necessary long term (it isn't in the CUDA backend), should we switch the order of the parameters here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yubingex007-a11y, "use" implementation seems to override the default values for rows and cols. Can you please check that?
I think use should come second in order so we don't have to set a default value for it

matrix_use U, matrix_layout L,
access::address_space Space>
void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the distinction between the matrix_layout L template parameter and matrix_layout memL parameter here?

Copy link
Contributor

@JackAKirk JackAKirk Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically do you plan on keeping the matrix_layout L template parameter in joint_matrix_load and joint_matrix_store?
Perhaps you have only kept this in because you think it is necessary for compatibility with the CUDA backend, since I notice that in the latest implementation, #5835, L is set as layout::unused by default?
In the CUDA implementation I took the approach of overloading joint_matrix_load: the first overload,

, was intended to be compatible with the Intel backend, because layout is provided as a normal runtime parameter. This is used in the CUDA backend to load the accumulator matrices.
The second overload, , was intended to only be used in the CUDA and HIP AMD backends, because it is required for the A/B "multiplicand" matrices for which the layout is required at compile time. In this overload there is no runtime layout parameter; instead the layouts of the matrices are inferred from the joint_matrix arguments.

In the Intel backend can the layouts of all matrices be specified at runtime? Or is it the case that, as with the CUDA backend, some of them require that the layout is known at compile time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we remove L from the template arguments of store and replace it with "unused" in joint matrix.

std::size_t M, std::size_t K, std::size_t N,
matrix_layout La, matrix_layout Lb,
matrix_layout Lc>
joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,
void joint_matrix_mad(Group sg, joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group& D joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group>& A,

Is it acceptable to make the above change to fully support the CUDA functionality as we discussed and as I implemented here: 4949464

Also I think that the A, B, C matrices should also be passed by reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we just add a new type for D matrix but keep it as a returned argument of joint_matrix_mad?

Copy link
Contributor

@JackAKirk JackAKirk Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the data type held by the D matrix can differ from the data type held by the C matrix. For example if type C is fp16 and type D is fp32; does the intel backend not have such cases? The only way to deal with this is using information provided by the user via arguments to joint_matrix_mad. The cleanest way to do this is the above change where the data type of D can be inferred from the joint_matrix D argument. This change allows us to implement two missing mma operations in the CUDA backend that have been added in this commit: e55e5f0.

Another option is if there is an optional output "precision" fourth argument to joint_matrix_mad that will specify the type of D: in any case I don't see how a fourth argument to joint_matrix_mad can be avoided if we want to enable the missing cases introduced in e55e5f0

The only other option I see is to not allow the type of D to differ from the type of C. I don't see how adding a new type for D matrix and keeping it as a returned argument can affect this in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed the first resolution here is to add a new template type Td.
Also, replace Lc, La, Lb by unused

Copy link
Contributor Author

@dkhaldi dkhaldi Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?

Yes that's right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we should NOT replace them, sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I meant to keep the template parameters but set them as default unused since I think that in the Intel backend you don't use them, and always use joint_matrix with layout::unused? In the CUDA backend they can be inferred from the joint_matrix so also don't have to be explicitly specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, good

@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 6, 2022

By the way, I corrected a lot of typos that are still relevant for the new document here: https://github.com/intel/llvm/pull/6525/files

Basically if you ignore the new sections I added there (#### Binary Multiply and Add and ##### wi_data as an marray for Nvidia® Tensor Cores) everything else is just small corrections that you could consider adding here.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Sep 9, 2022

By the way, I corrected a lot of typos that are still relevant for the new document here: https://github.com/intel/llvm/pull/6525/files

Basically if you ignore the new sections I added there (#### Binary Multiply and Add and ##### wi_data as an marray for Nvidia® Tensor Cores) everything else is just small corrections that you could consider adding here.

Yes, I will add these corrections here. Thanks.

@dkhaldi dkhaldi marked this pull request as ready for review September 13, 2022 14:34
@dkhaldi dkhaldi requested a review from a team as a code owner September 13, 2022 14:34
@gmlueck
Copy link
Contributor

gmlueck commented Sep 13, 2022

I have a question about the old specification, which is now named "sycl_ext_oneapi_deprecated_matrix_no_use". Is that version of the API still supported? Does (or will) the implementation print a warning message for applications that use the old API?

Or, is the old API no longer supported, and it is (or will be) removed?

Both options are OK for an experimental extension like this. However, we should treat the documentation differently in the two cases.

IMPORTANT: In the current implementation, only the static extent is supported

#### Use
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator (C) is required by backend implementations to reason about the layout of the matrix in registers.

Also (C) is rendered as the copyright sign so this needs to be corrected somehow.

enum class use {
a,
b,
accumulator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
accumulator,
accumulator

row_major,
col_major,
packed,
unused,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unused,
unused



#### Layout
Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear here as a reader how "symmetric" or "tiled" layouts relate to the values of the enum class "layout" below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove them

}
```

IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
Copy link
Contributor

@JackAKirk JackAKirk Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified

}
```

IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
IMPORTANT: In both AMX and DPAS support, the `use` template parameter is required

and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS_

## Introduction
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
Copy link
Contributor

@JackAKirk JackAKirk Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different matrix hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.

I personally think it is better to avoid using "tensor" apart from where absolutely necessary :i.e for Tensor Cores (c), because the word "tensor" means something different from "matrix": a matrix is a tensor if there exists transformational rules for the matrix elements with respect to a coordinate system, and this property that distinguishes tensor is actually not relevant at all for the hardware operations and this extension more generally. A (any-dimensional) matrix is not necessarily a tensor although a matrix can also be a tensor: In situations where all matrices operated on are either all tensors or all not tensors it doesn't make much difference whether we refer to "matrices" as "tensors", but for applications where both (non-tensor)"matrices" and "tensors" are used, proper use of language becomes important.

I'm not sure where the trend of replacing (any-dimensional)"matrix" with "tensor" originated, perhaps it was when "tensorflow" sounded better than "matrixflow", but I don't think there is really a reason for adopting this trend in this extension, particularly since we have adopted the naming conventions of joint_matrix etc rather than e.g. joint_tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A matrix is technically a 2d tensor and they are both mathematical objects.
tensorflow supports more than 2 dimensions. That's why I believe it is not called matrixflow.

I tend to refer to the hardware as tensor hardware because even though they are matrix hardware (supports only 2 dimensions), they enable ndimensional tensor computing (if the code is inside loops and such). Also, they are called TPUs in general in the literature. That's why the Nvidia tensor cores are called as such for instance. Habana uses tensor cores terminology as well.

Of course, we should refer to this extension as matrix because it enables only 2d computations.

Copy link
Contributor

@JackAKirk JackAKirk Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A matrix is only a (representation of a) 2-tensor (in a given basis) if it has a basis independent meaning, likewise for a 3d matrix (3-array) etc. I personally think that Tensor Processing Unit is a terrible name for the literature to have adopted for Processing Units that perform basis dependent computations on 2-dimensional arrays (matrices) that may be part of n-dimensional arrays (n-dimensional matrices) for n>=3, that may or may not be part of a program that grants the n-matrix a tensorial nature. Here is an article that mentions the question of over-use of tensor in ML, https://www.linkedin.com/pulse/tensorsand-applications-ml-demystified-andriy-strepetilov/, with a mostly reasonably description of the meaning of tensor. It's not worth worrying about too much but I guess... for a ML algorithm you can probably define a conservation law (conserving the output) and then define a geometry (set of transformations that satisfy the conservation law) for the inputs such that the inputs are invariant with respect to those transformations (then they are tensors). So maybe "tensor"flow is justified that way.. But for the TPUs I don't see how using "tensor" leads to anything but confusion personally. If we adopt "tensor" to mean n-matrix (n-array) then what will we call something that has tensorial properties...
Probably it doesn't matter at this point because as you say TPU appears to have become accepted as the general term in the literature. But when a word is misappropriated in society, meaning tends to be lost, and society can become poorer for it.


IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.

IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
IMPORTANT: Since the current Tensor Cores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores AOT backend.

Actually there are some aspects of the Tensor Cores backend that are not AOT, since there is JIT compilation from the ptx level to SASS level when the code is run. I think this would be more accurate and simpler personally:

Suggested change
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
IMPORTANT: It is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AOT is used here to reflect the SYCL compilation flow: If it does not use SPIRV and target is specified at compile time so this is basically AOT compilation for SYCL. But I will change the sentence to the one that does not talk about the AOT detail.



## VNNI/Packed Layout
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.


## VNNI/Packed Layout
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.

range<2> L = {1, SG_SIZE};
int8_t *memA = malloc_shared<int8_t>(M*K, q);
int8_t *memB = malloc_shared<int8_t>(K*N, q);
Int32_t *memC = malloc_shared<int32_t>(M*N, q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Int32_t *memC = malloc_shared<int32_t>(M*N, q);
int32_t *memC = malloc_shared<int32_t>(M*N, q);


- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.

- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.
- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.

|`defaultK`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
|`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A
|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
|`joint_matrix_c`| validation, default values| type alias for `joint_matrix` for matrix C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this refer to joint_matrix_accumulator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. What do you think @gmlueck ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have this inconsistency throughout the spec. In many places, you refer to the three matrices as a, b, and c. However, the use enum refers to them as a, b, and accumulator. You should decide on one or the other and then use it consistently throughout. Since you call the first two matrices a and b, it seems natural to me to call the third matrix c. Is there a good reason to call the third matrix accumulator?

namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N,
layout LayoutA, layout LayoutB>
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,

template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N,
layout LayoutA, layout LayoutB>
joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
joint_matrix<Ta, use::a, M, K, layoutA, Group> &A,

etc

joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res,
multi_ptr<T, Space> src, size_t stride, layout memL);

template <typename Group, typename T, size_t NumRows, size_t NumCols,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename Group, typename T, size_t NumRows, size_t NumCols,
template <typename Group, typename S, typename T, size_t NumRows, size_t NumCols,
use Use, layout Layout, access::address_space Space>
void joint_matrix_load(Group sg,
joint_matrix<S, Use, NumRows, NumCols, Layout, Group> &res,
multi_ptr<T, Space> src, size_t stride);

We need typename S as an implementation detail so we can support const T cases via: std::enable_if_t<std::is_same<S, std::remove_const_t>::value

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 17, 2022

@intel/dpcpp-doc-reviewers, any more reviews on this?
It will be good if we can merge this ASAP.

@JackAKirk JackAKirk self-requested a review October 17, 2022 19:25
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 21, 2022

@intel/dpcpp-doc-reviewers, can you please help merge this if there are no more comments?

@pvchupin
Copy link
Contributor

@dkhaldi, please move deprecated extension to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/deprecated

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 21, 2022

@dkhaldi, please move deprecated extension to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/deprecated

done

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 25, 2022

@intel/dpcpp-doc-reviewers, please help merge.

@pvchupin
Copy link
Contributor

Ping @intel/dpcpp-specification-reviewers

@gmlueck
Copy link
Contributor

gmlueck commented Oct 25, 2022

https://github.com/orgs/intel/teams/dpcpp-doc-reviewers, please help merge.

Sorry for the delay. This is the next thing on my review list.

Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions below, but also one general comment:

Since you moved the deprecated version of the API spec to "doc/extensions/deprecated", there isn't any need for the "doc/extensions/experimental/sycl_ext_oneapi_matrix" directory. Can you move the spec from:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

to:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc

This also means that you can undo the changes to "ReleaseNotes.md".

namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, use Use,
size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
layout Layout = layout::dynamic, typename Group = sub_group>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me of the rules for using Layout if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?

We want the default template parameters to result in portable code, and I see that the template parameter here defaults to dynamic. Is that portable?

Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me of the rules for using Layout if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?

Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.

We want the default template parameters to result in portable code, and I see that the template parameter here defaults to dynamic. Is that portable?

Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.

FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix: As here in the test for the portable interfaces:

https://github.com/intel/llvm-test-suite/blob/b8a6c135726072ab3e8ab07cb73b2c1d81fbafb0/SYCL/Matrix/joint_matrix_tensorcores.cpp#L151

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.

This restriction isn't conveyed by this spec. Do we expect this specification to describe a common API that includes CUDA?

FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix

I agree that this is useful. This could be achieved by defaulting the Layout parameter conditionally, depending on the value of Use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6662 (comment) is relevant here, but I'm sure @dkhaldi will explain the details.

Copy link
Contributor Author

@dkhaldi dkhaldi Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see under "Layout" subsection:
"Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."

I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).

So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.

@JackAKirk, @gmlueck do you want to proceed differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about prioritizing the portable case. But one of the outcomes of that meeting as well is that the restrictions will still be made in the text of the spec, not in the syntax. Not just yet.

If we make the restriction in the syntax, we are excluding the Intel flexibility.
In the text, we should add that for the CUDA backend, dynamic as default applied to only matrix accumulator.

One of the reasons of that outcome is that I did not want to make more changes at a time.
The change you are suggesting (in the syntax) should be made in a separate PR after we discuss that specific issue more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this as an open issue in the list at the bottom of this spec, then? I just want to make sure we do not forget about issues like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a full implementation of the agreed interfaces: #7077 with a working cuda implementation. See that we still use the layout::dynamic default parameter here:


but as I mentioned the cuda implementation won't allow a joint_matrix to be constructed with (use::a || use::b) && layout::dynamic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it to open issues.
Note that this is not the only non portable code that is included in this document. We still need to sort things like "packed" layout.
Once we have the full support of this new API (in a month or so) mentioned by Jack (#7077). I will open up discussion for the non portable scenarios that are still in this document.

Copy link
Contributor

@JackAKirk JackAKirk Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see under "Layout" subsection: "Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."

I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).

So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.

@JackAKirk, @gmlueck do you want to proceed differently?

I think this is fine as you describe.
If you think that some information is more appropriate in a cuda only extension or intel only extension or the main matrix extension, then I don't mind moving it, especially since these extensions are all experimental.

I imagine that it is more traditional that extensions are additive rather that subtractive, but I don't think this is a big issue at the moment: so long as the interfaces (note that I don't say "code" here since DPAS and Tensor Cores don't ever have cases where parameters provided to the interfaces are portable) written by a user are fully portable across cuda and intel for the cases where the hardware allows them to (all the practically useful cases) then this is the main thing.

void joint_matrix_store(Group sg,
joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't there two versions of joint_matrix_store, as you have for joint_matrix_load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for two versions. In both Intel and CUDA backends, store must take layout as a dynamic argument.
@JackAKirk please confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the template parameter of the matrix is something other than dynamic, though? For example, consider the case when Layout is row_major. Does it still make sense to require the user to specify a layout parameter at runtime to joint_matrix_store? What if the layout they specify at runtime is different from the Layout template parameter? Is that OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CUDA case, you can only store to matrix accumulator.
For the Intel backend, Besides C, you can store to matrix A or B. In this case, if users decide to do that, they assume it is not portable code, so they would use the version of joint matrix where A and B type layouts are dynamic.

Copy link
Contributor

@JackAKirk JackAKirk Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not something we needed to consider in the Nvidia backend because the nvidia backend doesn't support native implementations of joint_matrix_store for anything other than accumulator (and we don't plan on implementing it at all: actually we shouldn't be able to implement it as a software solution because it requires knowledge of the mapping of the joint_matrix registers to the original matrix and this is information that is not released in Nvidia documentation). My understanding is that the intel backend may want to support joint_matrix_store for use::a and use::b and that it can choose whatever interface it wants for this overload of joint_matrix_store: for the accumulator case the interfaces that the users uses for joint_matrix_store is identical for intel and cuda in the agreed version; I think this is all that matters. I thought that that the use::a/use::b cases might be added to an intel specific extension document, similar to what we did for cuda in #6968? If you want to add it in the main document instead and just note that it isn't supported for cuda I think that is fine too.

|Value |Description
|1 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The exact API used for this implementation is detailed in [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc)
|2 |joint matrix type has a new `use` parameter. `layout` on matrix is optional for `C` matrix. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported
|3 |Implementation on Nvidia Tensor Cores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this specification also apply to Nvidia, or is that still a separate specification? If it is still separate, can there be a link here to that other spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR for the sycl_ext_oneapi_matrix_cuda extension that is fully compatible with this extension (sycl_ext_oneapi_matrix) but built on top of it as an optional (cuda only extension) #6968

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?

Yeah sure I can move it to that folder.

@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 26, 2022

Since you moved the deprecated version of the API spec to "doc/extensions/deprecated", there isn't any need for the "doc/extensions/experimental/sycl_ext_oneapi_matrix" directory. Can you move the spec from:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

to:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc

The folder is still necessary to add CUDA backend additional specific API (like marray-based element wise operations). see #6968.

}
```

IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default for use::a and use::b? If there is no default, you should say that.

Note that this does not restrict the use of layout::dynamic, it only changes the default. Your commit comment seems to imply that you want to restrict the use of layout::dynamic. Was that your intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no default for use::a and use::b. the default value dynamic applies only to matrix::accumulator.
In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.

Is the note in the text good enough?
"IMPORTANT: Matrix layout defaulting to layout::dynamic applies only to matrix with use::accumulator"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.

If you still want to support layout::dynamic for use::a and use::b, then you have a bug in the definition of joint_matrix_load because that function only allows loading a matrix with dynamic layout for use::accumulator.

Copy link
Contributor Author

@dkhaldi dkhaldi Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the portable API, so it should not allow layout::dynamic for use::a and use::b.
To support this case, it will be added to the intel-specific extension

multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
}
```
This function stores the data from the 2d tiles back to memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your latest changes, it is only possible to store the Accumulator, not A or B. It would be good to say that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@bader bader changed the title update the matrix spec based on new use argument [SYCL][Spec] Update the matrix spec based on new use argument Oct 30, 2022
@dkhaldi
Copy link
Contributor Author

dkhaldi commented Oct 31, 2022

@intel/dpcpp-specification-reviewers, please help merge

@steffenlarsen steffenlarsen merged commit f2983fc into intel:sycl Oct 31, 2022
steffenlarsen pushed a commit that referenced this pull request Dec 12, 2022
…ce (#7077)

CUDA backend implementation using the "unified" matrix extension
interface. The same interface will be used for a future Intel backend
implementation of the matrix extension.

- New "unified" interface uses SYCL_EXT_ONEAPI_MATRIX_VERSION=4
- `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad` and
`joint_matrix` interfaces match the new spec from
#6662
- Separated `joint_matrix_*` functions into new header
matrix-unified.hpp: Intel backend implementations can be called from the
same functions in the future.
- C++17 everywhere in line with #6678
- Updated device code tests to use new interfaces
- Completely removed uint16 implementations that are replaced by
bfloat16 that is being moved out of the experimental namespace
- Updated all CUDA runtime matrix tests here:
intel/llvm-test-suite#1183

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants