[SYCL][Spec] Update the matrix spec based on new use argument #6662

dkhaldi · 2022-08-29T20:03:18Z

No description provided.

…til implementation of the new one is stable

JackAKirk · 2022-08-30T16:31:22Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc,


Does Intel hardware support Ta != Tb? I'm surprised if they would ever be different types?

Also in the CUDA + HIP AMD backends it is not necessary to provide the matrix_layout (doc should be updated to "layout" at some point) template parameter to joint_matrix_mad: it looks like it is always necessary for intel hardware? if so I can add it to the CUDA implementation so that the interfaces match.

JackAKirk · 2022-08-30T16:34:18Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          matrix_use U, matrix_layout L,
+          access::address_space Space>	  
+  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,


Ah I forgot that Intel hardware supports store for non accumulator matrices, I'll update #6657 adding the use template parameter and throw a runtime error if it is not use::accumulator in the CUDA backend.

We remove L from the template arguments of store and replace it with "unused" in joint matrix.

JackAKirk · 2022-08-30T16:37:01Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
+
+IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.


Suggested change

IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.

IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.

I guess if you are removing packed_b and packed_a this sentence will change like this?

JackAKirk · 2022-08-30T16:38:48Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 


There is a potential issue here: Rows and Cols have default values, but there isn't a default value for the use parameter. Assuming the Rows/Cols default value is necessary long term (it isn't in the CUDA backend), should we switch the order of the parameters here?

@yubingex007-a11y, "use" implementation seems to override the default values for rows and cols. Can you please check that?
I think use should come second in order so we don't have to set a default value for it

JackAKirk · 2022-08-30T16:40:22Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+          matrix_use U, matrix_layout L,
+          access::address_space Space>
+  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
+		    multi_ptr<T, Space> src, size_t stride, matrix_layout memL);


Can you clarify the distinction between the matrix_layout L template parameter and matrix_layout memL parameter here?

Specifically do you plan on keeping the matrix_layout L template parameter in joint_matrix_load and joint_matrix_store?
Perhaps you have only kept this in because you think it is necessary for compatibility with the CUDA backend, since I notice that in the latest implementation, #5835, L is set as layout::unused by default?
In the CUDA implementation I took the approach of overloading joint_matrix_load: the first overload,

llvm/sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

Line 752 in e55e5f0

void joint_matrix_load(

, was intended to be compatible with the Intel backend, because layout is provided as a normal runtime parameter. This is used in the CUDA backend to load the accumulator matrices.
The second overload,

llvm/sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

Line 780 in e55e5f0

void joint_matrix_load(

, was intended to only be used in the CUDA and HIP AMD backends, because it is required for the A/B "multiplicand" matrices for which the layout is required at compile time. In this overload there is no runtime layout parameter; instead the layouts of the matrices are inferred from the joint_matrix arguments.

In the Intel backend can the layouts of all matrices be specified at runtime? Or is it the case that, as with the CUDA backend, some of them require that the layout is known at compile time?

Same here, we remove L from the template arguments of store and replace it with "unused" in joint matrix.

JackAKirk · 2022-08-30T16:46:14Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+          std::size_t M, std::size_t K, std::size_t N,
+	  matrix_layout La, matrix_layout Lb,
+          matrix_layout Lc>
+  joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,


Suggested change

joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,

void joint_matrix_mad(Group sg, joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group& D joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group>& A,

Is it acceptable to make the above change to fully support the CUDA functionality as we discussed and as I implemented here: 4949464

Also I think that the A, B, C matrices should also be passed by reference.

What happens if we just add a new type for D matrix but keep it as a returned argument of joint_matrix_mad?

The issue is that the data type held by the D matrix can differ from the data type held by the C matrix. For example if type C is fp16 and type D is fp32; does the intel backend not have such cases? The only way to deal with this is using information provided by the user via arguments to joint_matrix_mad. The cleanest way to do this is the above change where the data type of D can be inferred from the joint_matrix D argument. This change allows us to implement two missing mma operations in the CUDA backend that have been added in this commit: e55e5f0.

Another option is if there is an optional output "precision" fourth argument to joint_matrix_mad that will specify the type of D: in any case I don't see how a fourth argument to joint_matrix_mad can be avoided if we want to enable the missing cases introduced in e55e5f0

The only other option I see is to not allow the type of D to differ from the type of C. I don't see how adding a new type for D matrix and keeping it as a returned argument can affect this in any way.

We agreed the first resolution here is to add a new template type Td.
Also, replace Lc, La, Lb by unused

@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?

@JackAKirk, I thought for the CUDA backend, LA and LB are needed so we should replace them by unused here, right?

Yes that's right.

I meant we should NOT replace them, sorry.

Yeah I meant to keep the template parameters but set them as default unused since I think that in the Intel backend you don't use them, and always use joint_matrix with layout::unused? In the CUDA backend they can be inferred from the joint_matrix so also don't have to be explicitly specified.

That's right, good

JackAKirk · 2022-09-06T09:12:21Z

By the way, I corrected a lot of typos that are still relevant for the new document here: https://github.com/intel/llvm/pull/6525/files

Basically if you ignore the new sections I added there (#### Binary Multiply and Add and ##### wi_data as an marray for Nvidia® Tensor Cores) everything else is just small corrections that you could consider adding here.

dkhaldi · 2022-09-09T16:01:23Z

By the way, I corrected a lot of typos that are still relevant for the new document here: https://github.com/intel/llvm/pull/6525/files

Basically if you ignore the new sections I added there (#### Binary Multiply and Add and ##### wi_data as an marray for Nvidia® Tensor Cores) everything else is just small corrections that you could consider adding here.

Yes, I will add these corrections here. Thanks.

…al argument

gmlueck · 2022-09-13T14:58:05Z

I have a question about the old specification, which is now named "sycl_ext_oneapi_deprecated_matrix_no_use". Is that version of the API still supported? Does (or will) the implementation print a warning message for applications that use the old API?

Or, is the old API no longer supported, and it is (or will be) removed?

Both options are OK for an experimental extension like this. However, we should treat the documentation differently in the two cases.

JackAKirk · 2022-09-13T15:08:29Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+IMPORTANT: In the current implementation, only the static extent is supported
+
+#### Use
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.


Suggested change

Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.

Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator (C) is required by backend implementations to reason about the layout of the matrix in registers.

Also (C) is rendered as the copyright sign so this needs to be corrected somehow.

JackAKirk · 2022-09-13T15:09:22Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+enum class use {
+  a,
+  b,
+  accumulator,


Suggested change

accumulator,

accumulator

JackAKirk · 2022-09-13T15:09:37Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+  row_major,
+  col_major,
+  packed,
+  unused,


Suggested change

unused,

unused

JackAKirk · 2022-09-13T15:10:50Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+
+#### Layout
+Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.


It isn't clear here as a reader how "symmetric" or "tiled" layouts relate to the values of the enum class "layout" below.

I will remove them

JackAKirk · 2022-09-13T15:11:24Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+}
+```
+
+IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified


Suggested change

IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified

IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified

JackAKirk · 2022-09-13T15:11:42Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+}
+```
+
+IMPORTANT: In both AMX and DPAS support, `use` template parameter is required


Suggested change

IMPORTANT: In both AMX and DPAS support, `use` template parameter is required

IMPORTANT: In both AMX and DPAS support, the `use` template parameter is required

JackAKirk · 2022-09-13T15:28:10Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS_
+
+## Introduction
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.


Suggested change

This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.

This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different matrix hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.

I personally think it is better to avoid using "tensor" apart from where absolutely necessary :i.e for Tensor Cores (c), because the word "tensor" means something different from "matrix": a matrix is a tensor if there exists transformational rules for the matrix elements with respect to a coordinate system, and this property that distinguishes tensor is actually not relevant at all for the hardware operations and this extension more generally. A (any-dimensional) matrix is not necessarily a tensor although a matrix can also be a tensor: In situations where all matrices operated on are either all tensors or all not tensors it doesn't make much difference whether we refer to "matrices" as "tensors", but for applications where both (non-tensor)"matrices" and "tensors" are used, proper use of language becomes important.

I'm not sure where the trend of replacing (any-dimensional)"matrix" with "tensor" originated, perhaps it was when "tensorflow" sounded better than "matrixflow", but I don't think there is really a reason for adopting this trend in this extension, particularly since we have adopted the naming conventions of joint_matrix etc rather than e.g. joint_tensor.

A matrix is technically a 2d tensor and they are both mathematical objects.
tensorflow supports more than 2 dimensions. That's why I believe it is not called matrixflow.

I tend to refer to the hardware as tensor hardware because even though they are matrix hardware (supports only 2 dimensions), they enable ndimensional tensor computing (if the code is inside loops and such). Also, they are called TPUs in general in the literature. That's why the Nvidia tensor cores are called as such for instance. Habana uses tensor cores terminology as well.

Of course, we should refer to this extension as matrix because it enables only 2d computations.

A matrix is only a (representation of a) 2-tensor (in a given basis) if it has a basis independent meaning, likewise for a 3d matrix (3-array) etc. I personally think that Tensor Processing Unit is a terrible name for the literature to have adopted for Processing Units that perform basis dependent computations on 2-dimensional arrays (matrices) that may be part of n-dimensional arrays (n-dimensional matrices) for n>=3, that may or may not be part of a program that grants the n-matrix a tensorial nature. Here is an article that mentions the question of over-use of tensor in ML, https://www.linkedin.com/pulse/tensorsand-applications-ml-demystified-andriy-strepetilov/, with a mostly reasonably description of the meaning of tensor. It's not worth worrying about too much but I guess... for a ML algorithm you can probably define a conservation law (conserving the output) and then define a geometry (set of transformations that satisfy the conservation law) for the inputs such that the inputs are invariant with respect to those transformations (then they are tensors). So maybe "tensor"flow is justified that way.. But for the TPUs I don't see how using "tensor" leads to anything but confusion personally. If we adopt "tensor" to mean n-matrix (n-array) then what will we call something that has tensorial properties...
Probably it doesn't matter at this point because as you say TPU appears to have become accepted as the general term in the literature. But when a word is misappropriated in society, meaning tends to be lost, and society can become poorer for it.

JackAKirk · 2022-09-13T15:41:55Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+
+IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. 


Suggested change

IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.

IMPORTANT: Since the current Tensor Cores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores AOT backend.

Actually there are some aspects of the Tensor Cores backend that are not AOT, since there is JIT compilation from the ptx level to SASS level when the code is run. I think this would be more accurate and simpler personally:

Suggested change

IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.

IMPORTANT: It is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.

AOT is used here to reflect the SYCL compilation flow: If it does not use SPIRV and target is specified at compile time so this is basically AOT compilation for SYCL. But I will change the sentence to the one that does not talk about the AOT detail.

JackAKirk · 2022-09-13T15:43:47Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+
+## VNNI/Packed Layout
+Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.


Suggested change

Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.

Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.

JackAKirk · 2022-09-13T15:44:47Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+## VNNI/Packed Layout
+Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.


Suggested change

The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.

The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.

JackAKirk · 2022-09-13T15:45:45Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+Int32_t *memC = malloc_shared<int32_t>(M*N, q);


Suggested change

Int32_t *memC = malloc_shared<int32_t>(M*N, q);

int32_t *memC = malloc_shared<int32_t>(M*N, q);

JackAKirk · 2022-09-13T15:49:35Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.


Suggested change

- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.

- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.

JackAKirk · 2022-09-13T15:51:14Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+|`defaultK`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
+|`joint_matrix_c`|  validation, default values| type alias for `joint_matrix` for matrix C


Should this refer to joint_matrix_accumulator?

This is a good question. What do you think @gmlueck ?

You have this inconsistency throughout the spec. In many places, you refer to the three matrices as a, b, and c. However, the use enum refers to them as a, b, and accumulator. You should decide on one or the other and then use it consistently throughout. Since you call the first two matrices a and b, it seems natural to me to call the third matrix c. Is there a good reason to call the third matrix accumulator?

JackAKirk · 2022-10-14T15:29:22Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,


Suggested change

joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,

joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,

JackAKirk · 2022-10-14T15:31:37Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
+    joint_matrix<Ta, use::a, M, K, layoutA, Group> A,


Suggested change

joint_matrix<Ta, use::a, M, K, layoutA, Group> A,

joint_matrix<Ta, use::a, M, K, layoutA, Group> &A,

etc

JackAKirk · 2022-10-14T15:39:15Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+    joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res,
+    multi_ptr<T, Space> src, size_t stride, layout memL);
+
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,


Suggested change

template <typename Group, typename T, size_t NumRows, size_t NumCols,

template <typename Group, typename S, typename T, size_t NumRows, size_t NumCols,

use Use, layout Layout, access::address_space Space>

void joint_matrix_load(Group sg,

joint_matrix<S, Use, NumRows, NumCols, Layout, Group> &res,

multi_ptr<T, Space> src, size_t stride);

We need typename S as an implementation detail so we can support const T cases via: std::enable_if_t<std::is_same<S, std::remove_const_t>::value

dkhaldi · 2022-10-17T18:32:56Z

@intel/dpcpp-doc-reviewers, any more reviews on this?
It will be good if we can merge this ASAP.

dkhaldi · 2022-10-21T17:34:17Z

@intel/dpcpp-doc-reviewers, can you please help merge this if there are no more comments?

pvchupin · 2022-10-21T18:17:12Z

@dkhaldi, please move deprecated extension to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/deprecated

…uggestion

dkhaldi · 2022-10-21T19:13:31Z

@dkhaldi, please move deprecated extension to https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions/deprecated

done

dkhaldi · 2022-10-25T18:19:03Z

@intel/dpcpp-doc-reviewers, please help merge.

pvchupin · 2022-10-25T18:41:29Z

Ping @intel/dpcpp-specification-reviewers

gmlueck · 2022-10-25T18:42:50Z

https://github.com/orgs/intel/teams/dpcpp-doc-reviewers, please help merge.

Sorry for the delay. This is the next thing on my review list.

gmlueck

Some questions below, but also one general comment:

Since you moved the deprecated version of the API spec to "doc/extensions/deprecated", there isn't any need for the "doc/extensions/experimental/sycl_ext_oneapi_matrix" directory. Can you move the spec from:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

to:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc

This also means that you can undo the changes to "ReleaseNotes.md".

gmlueck · 2022-10-26T13:36:11Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename T, use Use,
+          size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
+          layout Layout = layout::dynamic, typename Group = sub_group>


Can you remind me of the rules for using Layout if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?

We want the default template parameters to result in portable code, and I see that the template parameter here defaults to dynamic. Is that portable?

Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.

Can you remind me of the rules for using Layout if you want portable code that works on both Intel and CUDA? I thought the "A" and "B" matrices couldn't be dynamic if you wanted to run on CUDA, is that right?

Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.

We want the default template parameters to result in portable code, and I see that the template parameter here defaults to dynamic. Is that portable?

Don't we also want a description somewhere in this spec explaining how to use "Layout" if you want portable code? I don't see anything like that now.

FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix: As here in the test for the portable interfaces:

https://github.com/intel/llvm-test-suite/blob/b8a6c135726072ab3e8ab07cb73b2c1d81fbafb0/SYCL/Matrix/joint_matrix_tensorcores.cpp#L151

Yes the A/B matrices cannot be dynamic for cuda: we use partial specializations for all valid joint_matrix declarations in the cuda backend, so it isn't possible to construct such a use::a/use::b joint_matrix with layout::dynamic in the cuda backend.

This restriction isn't conveyed by this spec. Do we expect this specification to describe a common API that includes CUDA?

FYI for the cuda backend it is also useful for layout::dynamic to be set as default because it means users don't even need to provide a layout template argument when constructing an accumulator joint_matrix

I agree that this is useful. This could be achieved by defaulting the Layout parameter conditionally, depending on the value of Use.

#6662 (comment) is relevant here, but I'm sure @dkhaldi will explain the details.

Please see under "Layout" subsection:
"Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."

I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).

So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.

@JackAKirk, @gmlueck do you want to proceed differently?

You are right about prioritizing the portable case. But one of the outcomes of that meeting as well is that the restrictions will still be made in the text of the spec, not in the syntax. Not just yet.

If we make the restriction in the syntax, we are excluding the Intel flexibility.
In the text, we should add that for the CUDA backend, dynamic as default applied to only matrix accumulator.

One of the reasons of that outcome is that I did not want to make more changes at a time.
The change you are suggesting (in the syntax) should be made in a separate PR after we discuss that specific issue more.

Can you please add this as an open issue in the list at the bottom of this spec, then? I just want to make sure we do not forget about issues like this.

Here is a full implementation of the agreed interfaces: #7077 with a working cuda implementation. See that we still use the layout::dynamic default parameter here:

llvm/sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcores.hpp

Line 33 in 36004a0

struct joint_matrix;

but as I mentioned the cuda implementation won't allow a joint_matrix to be constructed with (use::a || use::b) && layout::dynamic.

I will add it to open issues.
Note that this is not the only non portable code that is included in this document. We still need to sort things like "packed" layout.
Once we have the full support of this new API (in a month or so) mentioned by Jack (#7077). I will open up discussion for the non portable scenarios that are still in this document.

Please see under "Layout" subsection: "Important | In both AMX and DPAS support, the layout template parameter can be dynamic for Matrix A and B as well."

I only talk about what is allowed for AMX and DPAS. This is not conveying the restriction about the CUDA backend because I expect Jack to update this file after it is merged to add the CUDA backend restrictions in joint matrix type (dynamic is the default for only matrix C) and in the joint_matrix_load (load that takes layout as argument is only for C matrix).

So the restrictions that apply for the CUDA backend will be explained in the specification text by Jack and not imposed in the syntax. If imposed in the syntax, this means we are restricting Intel backend as well.

@JackAKirk, @gmlueck do you want to proceed differently?

I think this is fine as you describe.
If you think that some information is more appropriate in a cuda only extension or intel only extension or the main matrix extension, then I don't mind moving it, especially since these extensions are all experimental.

I imagine that it is more traditional that extensions are additive rather that subtractive, but I don't think this is a big issue at the moment: so long as the interfaces (note that I don't say "code" here since DPAS and Tensor Cores don't ever have cases where parameters provided to the interfaces are portable) written by a user are fully portable across cuda and intel for the cases where the hardware allows them to (all the practically useful cases) then this is the main thing.

gmlueck · 2022-10-26T13:46:43Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+  void joint_matrix_store(Group sg,
+    joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
+}


Why aren't there two versions of joint_matrix_store, as you have for joint_matrix_load?

No need for two versions. In both Intel and CUDA backends, store must take layout as a dynamic argument.
@JackAKirk please confirm

What if the template parameter of the matrix is something other than dynamic, though? For example, consider the case when Layout is row_major. Does it still make sense to require the user to specify a layout parameter at runtime to joint_matrix_store? What if the layout they specify at runtime is different from the Layout template parameter? Is that OK?

In the CUDA case, you can only store to matrix accumulator.
For the Intel backend, Besides C, you can store to matrix A or B. In this case, if users decide to do that, they assume it is not portable code, so they would use the version of joint matrix where A and B type layouts are dynamic.

This is not something we needed to consider in the Nvidia backend because the nvidia backend doesn't support native implementations of joint_matrix_store for anything other than accumulator (and we don't plan on implementing it at all: actually we shouldn't be able to implement it as a software solution because it requires knowledge of the mapping of the joint_matrix registers to the original matrix and this is information that is not released in Nvidia documentation). My understanding is that the intel backend may want to support joint_matrix_store for use::a and use::b and that it can choose whatever interface it wants for this overload of joint_matrix_store: for the accumulator case the interfaces that the users uses for joint_matrix_store is identical for intel and cuda in the agreed version; I think this is all that matters. I thought that that the use::a/use::b cases might be added to an intel specific extension document, similar to what we did for cuda in #6968? If you want to add it in the main document instead and just note that it isn't supported for cuda I think that is fine too.

gmlueck · 2022-10-26T14:01:49Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+|Value |Description
+|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The exact API used for this implementation is detailed in [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc)
+|2     |joint matrix type has a new `use` parameter. `layout` on matrix is optional for `C` matrix. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|3     |Implementation on Nvidia Tensor Cores


Does this specification also apply to Nvidia, or is that still a separate specification? If it is still separate, can there be a link here to that other spec?

Here is the PR for the sycl_ext_oneapi_matrix_cuda extension that is fully compatible with this extension (sycl_ext_oneapi_matrix) but built on top of it as an optional (cuda only extension) #6968

This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?

This specification does apply to CUDA. I expect Jack to add some of the restrictions/details that are missing once this is merged. #6968 contains the additional functions that only applies to CUDA backend. @JackAKirk, the file you add in #6968 should be included in the new matrix folder, right?

Yeah sure I can move it to that folder.

dkhaldi · 2022-10-26T18:45:52Z

Since you moved the deprecated version of the API spec to "doc/extensions/deprecated", there isn't any need for the "doc/extensions/experimental/sycl_ext_oneapi_matrix" directory. Can you move the spec from:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

to:

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc

The folder is still necessary to add CUDA backend additional specific API (like marray-based element wise operations). see #6968.

… this document

gmlueck · 2022-10-27T17:13:04Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+}
+```
+
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`


What is the default for use::a and use::b? If there is no default, you should say that.

Note that this does not restrict the use of layout::dynamic, it only changes the default. Your commit comment seems to imply that you want to restrict the use of layout::dynamic. Was that your intention?

There is no default for use::a and use::b. the default value dynamic applies only to matrix::accumulator.
In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.

Is the note in the text good enough?
"IMPORTANT: Matrix layout defaulting to layout::dynamic applies only to matrix with use::accumulator"

In the commit comment, by "restrict the API in this document to only the portable API", I meant you can only write portable code with this documented API.

If you still want to support layout::dynamic for use::a and use::b, then you have a bug in the definition of joint_matrix_load because that function only allows loading a matrix with dynamic layout for use::accumulator.

It is the portable API, so it should not allow layout::dynamic for use::a and use::b.
To support this case, it will be added to the intel-specific extension

gmlueck · 2022-10-27T17:14:11Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
+}
+```
+This function stores the data from the 2d tiles back to memory.


With your latest changes, it is only possible to store the Accumulator, not A or B. It would be good to say that here.

dkhaldi · 2022-10-31T14:25:51Z

@intel/dpcpp-specification-reviewers, please help merge

…ce (#7077) CUDA backend implementation using the "unified" matrix extension interface. The same interface will be used for a future Intel backend implementation of the matrix extension. - New "unified" interface uses SYCL_EXT_ONEAPI_MATRIX_VERSION=4 - `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad` and `joint_matrix` interfaces match the new spec from #6662 - Separated `joint_matrix_*` functions into new header matrix-unified.hpp: Intel backend implementations can be called from the same functions in the future. - C++17 everywhere in line with #6678 - Updated device code tests to use new interfaces - Completely removed uint16 implementations that are replaced by bfloat16 that is being moved out of the experimental namespace - Updated all CUDA runtime matrix tests here: intel/llvm-test-suite#1183 Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

dkhaldi added 3 commits August 29, 2022 13:02

update the matrix spec based on new use argument

d2a898a

update the matrix spec based on new use argument

2e98eb6

- Deprecate the old spec (no use) but keep document with no change un…

45564bd

…til implementation of the new one is stable

JackAKirk reviewed Aug 30, 2022

View reviewed changes

dkhaldi added 3 commits September 9, 2022 11:18

address Jack's comments and corrections

5eb53ed

replace matrix_use with use and matrix_layout with layout

90a04f5

Move the order of use in joint matrix type to before the first option…

40f8886

…al argument

dkhaldi marked this pull request as ready for review September 13, 2022 14:34

dkhaldi requested a review from a team as a code owner September 13, 2022 14:34

JackAKirk reviewed Sep 13, 2022

View reviewed changes

JackAKirk reviewed Oct 14, 2022

View reviewed changes

correct joint_matrix_mad signature

c5808b4

JackAKirk mentioned this pull request Oct 17, 2022

[SYCL][CUDA] Implementation of matrix ext using new "unified" interface #7077

Merged

dkhaldi added 3 commits October 17, 2022 12:55

Add IsDecorated to multi_ptr type

e05ade5

Merge remote-tracking branch 'intel_llvm/sycl' into use-doc

3208d42

Add IsDecorated new argument to multi_ptr type

b156860

JackAKirk self-requested a review October 17, 2022 19:25

JackAKirk approved these changes Oct 17, 2022

View reviewed changes

move deprecated extension to the deprecated folder based on Pavel's s…

ebe5d8d

…uggestion

pvchupin approved these changes Oct 25, 2022

View reviewed changes

gmlueck reviewed Oct 26, 2022

View reviewed changes

dkhaldi added 2 commits October 26, 2022 15:00

Add an open question about how to deal with non portable scenarios in…

9ce7273

… this document

restrict layout, load and store to the portable API

bc1c2cb

gmlueck reviewed Oct 27, 2022

View reviewed changes

clarification on the store in the text

f8a4587

gmlueck approved these changes Oct 28, 2022

View reviewed changes

bader changed the title ~~update the matrix spec based on new use argument~~ [SYCL][Spec] Update the matrix spec based on new use argument Oct 30, 2022

steffenlarsen merged commit f2983fc into intel:sycl Oct 31, 2022


		Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.

		IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.

	joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,
	void joint_matrix_mad(Group sg, joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group& D joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group>& A,

	Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.
	Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator (C) is required by backend implementations to reason about the layout of the matrix in registers.



		#### Layout
		Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.

	IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
	IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified


		IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.

		IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.

	IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
	IMPORTANT: Since the current Tensor Cores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores AOT backend.

	IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
	IMPORTANT: It is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.



		## VNNI/Packed Layout
		Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.

	Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
	Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.

	Int32_t memC = malloc_shared<int32_t>(MN, q);
	int32_t memC = malloc_shared<int32_t>(MN, q);


		- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.

		- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using.

	joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
	joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,

	joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
	joint_matrix<Ta, use::a, M, K, layoutA, Group> &A,

-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+  template <typename Group, typename S, typename T, size_t NumRows, size_t NumCols,
+          use Use, layout Layout, access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<S, Use, NumRows, NumCols, Layout, Group> &res,
+    multi_ptr<T, Space> src, size_t stride);

[SYCL][Spec] Update the matrix spec based on new use argument #6662

[SYCL][Spec] Update the matrix spec based on new use argument #6662

Conversation

dkhaldi commented Aug 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Sep 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Sep 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Sep 6, 2022 • edited Loading

dkhaldi commented Sep 9, 2022

gmlueck commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi commented Oct 17, 2022

dkhaldi commented Oct 21, 2022

pvchupin commented Oct 21, 2022

dkhaldi commented Oct 21, 2022

dkhaldi commented Oct 25, 2022

pvchupin commented Oct 25, 2022

gmlueck commented Oct 25, 2022

gmlueck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkhaldi Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

JackAKirk Sep 1, 2022 •

edited

Loading

JackAKirk Sep 1, 2022 •

edited

Loading

dkhaldi Sep 9, 2022 •

edited

Loading

JackAKirk commented Sep 6, 2022 •

edited

Loading

JackAKirk Sep 13, 2022 •

edited

Loading

JackAKirk Sep 13, 2022 •

edited

Loading

JackAKirk Sep 15, 2022 •

edited

Loading

dkhaldi Oct 26, 2022 •

edited

Loading

JackAKirk Oct 26, 2022 •

edited

Loading

JackAKirk Oct 26, 2022 •

edited

Loading

dkhaldi commented Oct 26, 2022 •

edited

Loading

dkhaldi Oct 29, 2022 •

edited

Loading