From d2a898a8d75ec353d7af5971419b81acab4e2e5e Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 29 Aug 2022 13:02:09 -0700
Subject: [PATCH 01/27] update the matrix spec based on new  use argument

---
 .../{ => sycl_ext_oneapi_matrix}/sycl_ext_oneapi_matrix.asciidoc  | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename sycl/doc/extensions/experimental/{ => sycl_ext_oneapi_matrix}/sycl_ext_oneapi_matrix.asciidoc (100%)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
similarity index 100%
rename from sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc
rename to sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

From 2e98eb6d14dcb83fd57d7f28044f3a5ed7e2c816 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 29 Aug 2022 13:04:30 -0700
Subject: [PATCH 02/27] update the matrix spec based on new  use argument

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 146 +++++++++---------
 1 file changed, 70 insertions(+), 76 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 71affb2cbb255..903e614721760 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -33,11 +33,10 @@ SYCL specification refer to that revision.
 
 **_NOTE:_** _This document describes the current design and API for the matrix
 extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving
-the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS_
 
 ## Introduction
-This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
 
 ## Feature test macro
 
@@ -52,39 +51,42 @@ value to determine which of the extension's APIs the implementation supports.
 [frame="none",options="header"]
 |======================
 |Value |Description
-|1     |Initial extension implementation on Intel AMX.  Base features are supported.
-|2     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|2     |joint matrix type has a new `use` parameter. `layout` on matrix is unused. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
 |======================
 
 ## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description:
+We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results into the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
-          matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+          matrix_use use, matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };
 }
 ```
 
+#### Shape
+The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
 
-#### Memory Scope
-In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
-
-IMPORTANT: In the current implementation, only the subgroup scope is supported
+IMPORTANT: In the current implementation, only the static extent is supported
 
-When the group is a `sycl::sub_group`, a matrix is declared as follows:
+#### Use
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.
 
 ```c++
-joint_matrix<int8_t, tM, tN> tA(sg); 
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class matrix_use {
+  a,
+  b,
+  accumulator,
+};
+}
 ```
 
-#### Shape
-The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
-
-IMPORTANT: In the current implementation, only the static extent is supported
+IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
 
 
 #### Layout
@@ -95,29 +97,34 @@ namespace sycl::ext::oneapi::experimental::matrix {
 enum class matrix_layout {
   row_major,
   col_major,
-  packed_a,
-  packed_b
+  packed,
+  unused,
 };
 }
 ```
 
-Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: 
+IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
 
-```c++
-joint_matrix<int8_t, K, N, packed_b> tB(sg);
-```   
-IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored.
+#### Memory Scope
+In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
+
+IMPORTANT: In the current implementation, only the subgroup scope is supported
 
+When the group is a `sycl::sub_group`, a matrix is declared as follows:
+
+```c++
+joint_matrix<int8_t, tM, tN, matrix_use::matrix_a> tA(sg); 
+```
 
 
 ## Matrix Operations and their Execution Scope
 We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features.
 
-The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout. 
+The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout. 
 
-Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
+Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
 	
-IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`.  Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout.  The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
+IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
 
 Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
 
@@ -129,10 +136,10 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout,
+          matrix_use U, matrix_layout L,
           access::address_space Space>
-  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
-		    multi_ptr<T, Space> src, size_t stride, matrix_layout MemLayout);
+  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
+		    multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
 }
 ```
 This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
@@ -142,9 +149,9 @@ This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout L,
+          matrix_use U, matrix_layout L,
           access::address_space Space>	  
-  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,
+  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
 		     multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
 }
 ```
@@ -158,8 +165,8 @@ namespace sycl::ext::oneapi::experimental::matrix {
           std::size_t M, std::size_t K, std::size_t N,
 	  matrix_layout La, matrix_layout Lb,
           matrix_layout Lc>
-  joint_matrix<Group, Tc, M, N, Lc> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, La, Group> A,
-               joint_matrix<Tb, K, N, Lb, Group> B, joint_matrix<Tc, M, N, Lc, Group> C);
+  joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,
+               joint_matrix<Tb, K, N, matrix_use::matrix_b, Lb, Group> B, joint_matrix<Tc, M, N, matrix_use::accumulator, Lc, Group> C);
 }
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
@@ -171,8 +178,8 @@ The current interface presented above assumes that all the matrices are directly
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout L, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, Tv v);
+           matrix_use U, matrix_layout L, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &m, Tv v);
 }
 ```
 IMPORTANT: In the current implementation, only the subgroup scope is supported.  
@@ -211,18 +218,18 @@ The code listing below shows a synopsis of these new APIs.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
+          matrix_use U, matrix_layout L = matrix_layout::unused,
           typename Group = sycl::sub_group>
 struct joint_matrix {
-   wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
+   wi_data<T, NumRows, NumCols, U, L, Group> get_wi_data();
 };
-template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
+template <typename T, size_t NumRows, size_t NumCols, matrix_use U, matrix_layout L, typename Group>
 class wi_data {
   size_t length();
-  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
+  wi_element<T, NumRows, NumCols, U, L, Group> operator[](size_t i);
 };
 template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
+          matrix_use U, matrix_layout L = matrix_layout::unused,
           typename Group = sycl::sub_group>
 class wi_element {
   operator T();
@@ -304,15 +311,13 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    const auto sg_startx = global_idx - item.get_local_id(0);
    const auto sg_starty = global_idy - item.get_local_id(1);
    sub_group sg = item.get_sub_group();
-   joint_matrix<int8_t, tM, tK> tA(sg);
-   // For B, since current implementation does not support non packed layout,
-   // users need to specify the packed_b layout
-   joint_matrix<int8_t, tK, tN, packed_b> tB(sg);
-   joint_matrix<int32_t, tM, tN> tC(sg);
+   joint_matrix<int8_t, tM, tK, a> tA(sg);
+   joint_matrix<int8_t, tK, tN, b> tB(sg);
+   joint_matrix<int32_t, tM, tN, accumulator> tC(sg);
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tk) {
      joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed); // VNNI
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = matC.get_wi_data();             
@@ -400,12 +405,12 @@ struct tpu_params<
   static constexpr std::size_t defaultK =
       (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, matrix_use::a, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, matrix_use::b Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, matrix_use::accumulator, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations
                           // because Intel AMX hardware supports dynamic sizes
@@ -450,12 +455,12 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   static constexpr std::size_t defaultN = 16;
   static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, matrix_use::a, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, matrix_use::b Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, matrix_use::accumulator Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
@@ -562,7 +567,7 @@ size_t NDRangeN = N / myparams::defaultN;
 //if M,N,K do not multiply the default sizes, padding has to be done 
 // device code: the matrices are constructed using the default dimensions  
 myparams::joint_matrix_a sub_a(sg);  
-myparams::joint_matrix_b<matrix_layout::packed_b> sub_b(sg);  
+myparams::joint_matrix_b sub_b(sg);  
 myparams::joint_matrix_c sub_c(sg);
 
 ```
@@ -577,23 +582,12 @@ constexpr int msize_remainder = break_dimension_remainder(params, M);
 constexpr int nsize = params.combinations[0].nsize;
 constexpr int ksize = params.combinations[0].ksize;
 // device code:
-joint_matrix<int8_t, msize, ksize> sub_a(sg);
-joint_matrix<int8_t, ksize, nsize, matrix_layout::packed_b> sub_b(sg);
-joint_matrix<int, msize, nsize> sub_c(sg);
+joint_matrix<int8_t, msize, ksize, matrix_use::a> sub_a(sg);
+joint_matrix<int8_t, ksize, nsize, matrix_use::b> sub_b(sg);
+joint_matrix<int, msize, nsize, matrix_use::accumulator> sub_c(sg);
 //Remainder handling
 ```
 
-//No don't need to provide more details in this section because the query interface can serve this. 
-
-//## Implementation Status
-
-//### oneAPI 2022.0 release
-//For oneAPI 2022.0 release, a JIT implementation has been made available on both Intel AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. 
-// The DPAS and Intel AMX implementations support the logical capability support of the HW
-
-
-
-
 ## Future-looking API
 
 ### Memory scope
@@ -601,7 +595,7 @@ The current experimental API uses `joint_` semantics to define the memory scope
 
 
 ```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, matrix_use::a>>(sg);
 ```
 We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
 
@@ -640,7 +634,6 @@ for (int i = 0; i < length; ++i) {
 
 ## TODO List
 - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform 
 - Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
 - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
 - Add a more realistic and complete example that shows the value of the general query
@@ -654,4 +647,5 @@ for (int i = 0; i < length; ++i) {
 |1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
 |2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
 |3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
+|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
 |======================

From 45564bd9c1b1f8d5161b60c3321138675ad4f567 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 29 Aug 2022 13:09:37 -0700
Subject: [PATCH 03/27] - Deprecate the old spec (no use) but keep document
 with no change until implementation of the new one is stable

---
 ...t_oneapi_deprecated_matrix_no_use.asciidoc | 657 ++++++++++++++++++
 1 file changed, 657 insertions(+)
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
new file mode 100644
index 0000000000000..71affb2cbb255
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
@@ -0,0 +1,657 @@
+# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2021 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 3 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+
+**_NOTE:_** _This document describes the current design and API for the matrix
+extension to {dpcpp}. This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving
+the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._
+
+## Introduction
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Initial extension implementation on Intel AMX.  Base features are supported.
+|2     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|======================
+
+## New `joint_matrix` class
+We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description:
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
+          matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+struct joint_matrix {
+    joint_matrix(Group g) {}
+};
+}
+```
+
+
+#### Memory Scope
+In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
+
+IMPORTANT: In the current implementation, only the subgroup scope is supported
+
+When the group is a `sycl::sub_group`, a matrix is declared as follows:
+
+```c++
+joint_matrix<int8_t, tM, tN> tA(sg); 
+```
+
+#### Shape
+The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
+
+IMPORTANT: In the current implementation, only the static extent is supported
+
+
+#### Layout
+Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts.
+	
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class matrix_layout {
+  row_major,
+  col_major,
+  packed_a,
+  packed_b
+};
+}
+```
+
+Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: 
+
+```c++
+joint_matrix<int8_t, K, N, packed_b> tB(sg);
+```   
+IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored.
+
+
+
+## Matrix Operations and their Execution Scope
+We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features.
+
+The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout. 
+
+Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
+	
+IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`.  Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout.  The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
+
+Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
+
+To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
+
+IMPORTANT: In the current implementation, only the subgroup scope is supported.  
+
+#### Load 
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          matrix_layout Layout,
+          access::address_space Space>
+  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
+		    multi_ptr<T, Space> src, size_t stride, matrix_layout MemLayout);
+}
+```
+This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
+
+
+#### Store 
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          matrix_layout L,
+          access::address_space Space>	  
+  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,
+		     multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
+}
+```
+This function stores the data from the 2d tiles back to memory.
+
+#### Multiply and Add
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc,
+          std::size_t M, std::size_t K, std::size_t N,
+	  matrix_layout La, matrix_layout Lb,
+          matrix_layout Lc>
+  joint_matrix<Group, Tc, M, N, Lc> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, La, Group> A,
+               joint_matrix<Tb, K, N, Lb, Group> B, joint_matrix<Tc, M, N, Lc, Group> C);
+}
+```
+The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
+
+
+#### Matrix Initialization: `joint_matrix_fill`
+The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: 
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          matrix_layout L, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, Tv v);
+}
+```
+IMPORTANT: In the current implementation, only the subgroup scope is supported.  
+
+#### Element Indexing and Piece-Wise Operations 
+##### Background
+Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
+
+Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
+
+Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
+
+##### Explicit conversion with mapping from SIMD to SPMD
+The data elements in a joint_matrix are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. 
+
+We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. So modifying `wi_data` means also modifying the joint matrix corresponding elements. The indexing provided inside the `wi_data` class acesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information must be known to reason about the matrix view and extract the relevant piece. But for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
+
+Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
+
+2- SG size is not fixed (like in the CUDA backend where warp size is always 32).
+
+3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`).
+
+In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.
+
+The code listing below shows a synopsis of these new APIs.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename T, size_t NumRows, size_t NumCols,
+          matrix_layout Layout = matrix_layout::row_major,
+          typename Group = sycl::sub_group>
+struct joint_matrix {
+   wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
+};
+template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
+class wi_data {
+  size_t length();
+  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
+};
+template <typename T, size_t NumRows, size_t NumCols,
+          matrix_layout Layout = matrix_layout::row_major,
+          typename Group = sycl::sub_group>
+class wi_element {
+  operator T();
+  wi_element &operator=(const T &rhs);
+…
+};
+}
+```
+
+In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
+
+
+```c++
+auto wi_data_c = matC.get_wi_data();             
+for (int i = 0; i < wi_data_c.length(); i++)                
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
+```
+
+IMPORTANT: In the current implementation, only the subgroup scope is supported.  
+
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. 
+
+IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. 
+
+
+## VNNI/Packed Layout
+Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type.
+
+#### Example 1: 16-bit elements
+      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed_b layout. 
+      // Here, packing of 2 elements is needed to form 32 bits.
+      // Element a1 is contiguous in memory with element a2, etc.
+      // ---------------------------------
+      // a1, a2, b1, b2, c1, c2, d1, d2
+      // a3, a4, b3, b4, c3, c4, d3, d4
+
+#### Example 2: 8-bit elements
+
+      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed_b layout.  
+      // Here, packing of 4 elements is needed to form 32 bits.
+      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
+      // ---------------------------------
+      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
+
+
+## Example using int8_t type
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+range<2> G = {M/tM, N};
+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+Int32_t *memC = malloc_shared<int32_t>(M*N, q);
+// Assuming memB has already been VNNIed
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
+   const auto global_idx = item.get_global_id(0);
+   const auto global_idy = item.get_global_id(1);
+   const auto sg_startx = global_idx - item.get_local_id(0);
+   const auto sg_starty = global_idy - item.get_local_id(1);
+   sub_group sg = item.get_sub_group();
+   joint_matrix<int8_t, tM, tK> tA(sg);
+   // For B, since current implementation does not support non packed layout,
+   // users need to specify the packed_b layout
+   joint_matrix<int8_t, tK, tN, packed_b> tB(sg);
+   joint_matrix<int32_t, tM, tN> tC(sg);
+   joint_matrix_fill(sg, tC, 0);
+   for (int k = 0; k < K; k += tk) {
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI
+     tC = joint_matrix_mad(sg, tA, tB, tC);
+   }
+   auto wi_data_c = matC.get_wi_data();             
+   for (int i = 0; i < wi_data_c.length(); i++)                
+     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
+   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major);
+}).wait();
+```
+
+== Query Interface
+Intel AMX, DPAS and Nvidia TPUs support different sizes and types. 
+The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
+so there will be no runtime errors.   
+The query interface proposed here consists of three functionalities:
+
+- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
+
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+
+The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
+
+[frame="none",options="header"]
+|======================
+| Member/type alias in `tpu_params` | Forms they are defined in |Description
+|`type_a`| validation, default values|type alias for the type of matrix A
+|`type_b`|  validation, default values|type alias for the type of matrix B
+|`type_c`|  validation, default values|type alias for the type of matrix C
+|`defaultM`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
+|`defaultN`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
+|`defaultK`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
+|`joint_matrix_c`|  validation, default values| type alias for `joint_matrix` for matrix C
+|`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false)
+|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS)
+|scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation
+|`combination` |  validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination
+|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
+|`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination
+|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
+|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
+|======================
+
+
+
+
+
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+
+
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int M=0, int N=0, int K=0>
+struct tpu_params;
+
+// Validation form: Valid or not
+// Specialization when both types and sizes are given
+template <typename Ta, typename Tb, typename Tc, int M, int N, int K>
+struct tpu_params<
+    tpu::amx, Ta, Tb, Tc, M, N, K,
+    typename std::enable_if<(
+        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
+        !std::is_same_v<Tc, void> && M != 0 && N != 0 && K != 0)>::type> {
+  // Validate that parameters are supported
+  static_assert(
+      (M == 0 && N == 0 && K == 0) ||
+          (is_combination_valid_amx<Ta, Tb, Tc>(M, N, K)),
+      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
+      "using: "
+      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
+
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_c = Tc; // this type alias is not available in the current implementation
+
+  // if combination is valid, construct the matrices
+
+  static constexpr std::size_t defaultM = (M != 0) ? M : 16;
+  static constexpr std::size_t defaultN = (N != 0) ? N : 16;
+  static constexpr std::size_t defaultK =
+      (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
+
+  static constexpr bool dynamic_p = false; // should be true in future implementations
+                          // because Intel AMX hardware supports dynamic sizes
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scope = scope_t::sub_group;
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type ctype;
+  };
+  // In this case, the combinations array contains only the combination that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, M, N, K}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// Default values form: Sizes-only query
+// Specialization for when only types are given, need to query only sizes
+template <typename Ta, typename Tb, typename Tc>
+struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
+                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
+                                           !std::is_same_v<Tb, void> &&
+                                           !std::is_same_v<Tc, void>)>::type> {
+  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
+                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
+                "and bf16 (Note that unsigned short should be used in the"
+                "DPC++ code to implement bf16) ");
+  
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_c = Tc; // this type alias is not available in the current implementation
+ 
+  // construct the matrices using the default sizes
+  static constexpr std::size_t defaultM = 16;
+  static constexpr std::size_t defaultN = 16;
+  static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
+  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
+
+  static constexpr bool dynamic_p = false; // should be true in future implementations because
+                          // Intel AMX hardware supports dynamic sizes
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scope = scope_t::sub_group;
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type ctype;
+  };
+  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
+  // types that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// General query form:
+// types are not given, no default sizes and no implicit matrix construction
+template <int M, int N, int K>
+struct tpu_params<tpu::amx, void, void, void, M, N, K> {
+  static constexpr bool dynamic_p = false; // should be true in future implementations because
+                          // Intel AMX hardware supports dynamic sizes
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scope = scope_t::sub_group;
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type ctype;
+  };
+  
+  static constexpr combination combinations[] = {
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+
+enum class tpu {
+  dpas,
+  amx
+};
+
+enum class matrix_type {
+  bf16,
+  fp16,
+  fp19,  // tfloat32
+  fp32,
+  fp64,
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32, 
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
+};
+
+enum class scope_t {
+  sub_group,
+  work_group
+};
+}
+```
+
+
+=== Validation Example:
+```c++
+// User can provide sizes besides the types and tpu_params can assert if they are supported or not
+// in this case, an assertion will happens as 16 is not a supported size for M
+using myparams = tpu_params<tpu::dpas, int8_t, int8_t, int, 16, 8, 32>;  
+size_t NDRangeM = M / myparams::defaultM;  //Assertion would happen at this line
+size_t NDRangeN = N / myparams::defaultN;
+```
+
+=== Default Values Example:
+```c++
+using myparams = tpu_params_both<tpu::dpas, int8_t, int8_t, int>;  
+// use this to construct the ranges on the host side  
+size_t NDRangeM = M / myparams::defaultM;  
+size_t NDRangeN = N / myparams::defaultN;
+//if M,N,K do not multiply the default sizes, padding has to be done 
+// device code: the matrices are constructed using the default dimensions  
+myparams::joint_matrix_a sub_a(sg);  
+myparams::joint_matrix_b<matrix_layout::packed_b> sub_b(sg);  
+myparams::joint_matrix_c sub_c(sg);
+
+```
+
+=== General Query Example:
+```c++
+constexpr int M = 1500; // with msize = 8 and msize = 4, 
+          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
+tpu_params<tpu::dpas> params;
+constexpr int msize = break_dimension(params, M);
+constexpr int msize_remainder = break_dimension_remainder(params, M);
+constexpr int nsize = params.combinations[0].nsize;
+constexpr int ksize = params.combinations[0].ksize;
+// device code:
+joint_matrix<int8_t, msize, ksize> sub_a(sg);
+joint_matrix<int8_t, ksize, nsize, matrix_layout::packed_b> sub_b(sg);
+joint_matrix<int, msize, nsize> sub_c(sg);
+//Remainder handling
+```
+
+//No don't need to provide more details in this section because the query interface can serve this. 
+
+//## Implementation Status
+
+//### oneAPI 2022.0 release
+//For oneAPI 2022.0 release, a JIT implementation has been made available on both Intel AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. 
+// The DPAS and Intel AMX implementations support the logical capability support of the HW
+
+
+
+
+## Future-looking API
+
+### Memory scope
+The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
+
+
+```c++
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);
+```
+We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
+
+### WI data to joint matrix mapping coordinates information for piece-wise operations
+The indexing provided inside the `wi_data` class acesses only the portion of the current WI. It is not possible the location or coordinates of this portion in the original matrix.  This coordinates mapping  is implementation defined and change from one backend to the other.   For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information is needed to reason about the matrix view.
+With joint matrix, we want to write, as much as possible, one code to run on different backends. So if backend X states that a WI owns one exact row of the matrix for instance. Writing the following code will work only on that backend for that version of hardware. The hardware and implementations change, for instance, the same WI can own half of the row because SG size increased or hardware units increased. 
+
+```c++
+auto data = C.get_wi_data();
+for (int i = 0; i < length; ++i) {
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+
+
+We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-code this mapping, we write  general backend and target-agnostic, especially in the JIT compilation mode of SYCL. This is possible by querying this mapping so code does not have to change from one version to the other.
+
+So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
+```c++
+auto data = C.get_wi_data();
+for (int i = 0; i < length; ++i) {
+  auto row, col = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+
+## Open Questions
+- Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added?
+- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX?
+-- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. 
+- Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"
+
+- In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
+
+## TODO List
+- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
+- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform 
+- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
+- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
+- Add a more realistic and complete example that shows the value of the general query
+
+
+## Revision History
+
+[frame="none",options="header"]
+|======================
+|Rev |Date       |Author     |Changes
+|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
+|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
+|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
+|======================

From 5eb53edf9bcc097eaeef0166951eb6cf4358b5bd Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 9 Sep 2022 11:18:55 -0700
Subject: [PATCH 04/27] address Jack's comments and corrections

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 59 +++++++++----------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 903e614721760..ba62e22f475f1 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -56,7 +56,7 @@ value to determine which of the extension's APIs the implementation supports.
 |======================
 
 ## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results into the following description:
+We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results in the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -69,7 +69,7 @@ struct joint_matrix {
 ```
 
 #### Shape
-The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
+The same class, `joint_matrix`, should handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
 
 IMPORTANT: In the current implementation, only the static extent is supported
 
@@ -90,7 +90,7 @@ IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
 
 
 #### Layout
-Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts.
+Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
 	
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -106,7 +106,7 @@ enum class matrix_layout {
 IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
 
 #### Memory Scope
-In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
+In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
 
 IMPORTANT: In the current implementation, only the subgroup scope is supported
 
@@ -118,13 +118,13 @@ joint_matrix<int8_t, tM, tN, matrix_use::matrix_a> tA(sg);
 
 
 ## Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features.
+We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
 
-The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout. 
+The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed`). `stride` describes the number of elements between consecutive rows for row major and packed layouts, or between columns for the column major layout. 
 
-Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
+Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
 	
-IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
+IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
 
 Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
 
@@ -136,9 +136,8 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, matrix_layout L,
-          access::address_space Space>
-  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
+          matrix_use U, access::address_space Space>
+  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, matrix_layout::unused, Group> &res,
 		    multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
 }
 ```
@@ -149,9 +148,8 @@ This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, matrix_layout L,
-          access::address_space Space>	  
-  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &res,
+          matrix_use U, access::address_space Space>	  
+  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, matrix_layout::unused, Group> &res,
 		     multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
 }
 ```
@@ -162,18 +160,18 @@ This function stores the data from the 2d tiles back to memory.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc,
-          std::size_t M, std::size_t K, std::size_t N,
-	  matrix_layout La, matrix_layout Lb,
-          matrix_layout Lc>
-  joint_matrix<Group, Tc, M, N, matrix_use::accumulator, Lc, Group> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, matrix_use::matrix_a, La, Group> A,
-               joint_matrix<Tb, K, N, matrix_use::matrix_b, Lb, Group> B, joint_matrix<Tc, M, N, matrix_use::accumulator, Lc, Group> C);
+          std::size_t M, std::size_t K, std::size_t N>
+  joint_matrix<Group, Td, M, N, matrix_use::accumulator, matrix_layout::unused, Group> joint_matrix_mad(Group sg,
+    joint_matrix<Ta, M, K, matrix_use::matrix_a, matrix_layout::unused, Group> A,
+    joint_matrix<Tb, K, N, matrix_use::matrix_b, matrix_layout::unused, Group> B,
+    joint_matrix<Tc, M, N, matrix_use::accumulator, matrix_layout::unused, Group> C);
 }
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
 
 
 #### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: 
+The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -195,11 +193,11 @@ Class 2- Piece-wise operations where the operation depends on the element index
 // We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
 
 ##### Explicit conversion with mapping from SIMD to SPMD
-The data elements in a joint_matrix are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. 
+The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
 
-We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. So modifying `wi_data` means also modifying the joint matrix corresponding elements. The indexing provided inside the `wi_data` class acesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
+We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
 
-Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information must be known to reason about the matrix view and extract the relevant piece. But for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
+Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
 
 Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
 
@@ -207,7 +205,7 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th
 
 1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
 
-2- SG size is not fixed (like in the CUDA backend where warp size is always 32).
+2- SG size is not generally fixed.
 
 3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`).
 
@@ -249,16 +247,16 @@ for (int i = 0; i < wi_data_c.length(); i++)
         wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
 ```
 
-IMPORTANT: In the current implementation, only the subgroup scope is supported.  
+IMPORTANT: In the current implementation, only the subgroup scope is supported.
 
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. 
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
 
 IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. 
 
 
 ## VNNI/Packed Layout
 Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type.
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
 
 #### Example 1: 16-bit elements
       // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
@@ -269,7 +267,7 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // a3, b3, c3, d3
       // a4, b4, c4, d4
       // ---------------------------------
-      // The same matrix reformatted in packed_b layout. 
+      // The same matrix reformatted in packed layout. 
       // Here, packing of 2 elements is needed to form 32 bits.
       // Element a1 is contiguous in memory with element a2, etc.
       // ---------------------------------
@@ -286,7 +284,7 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // a3, b3, c3, d3
       // a4, b4, c4, d4
       // ---------------------------------
-      // The same matrix reformatted in packed_b layout.  
+      // The same matrix reformatted in packed layout.  
       // Here, packing of 4 elements is needed to form 32 bits.
       // Elements a1, a2, a3, a4 are contiguous in memory, etc.
       // ---------------------------------
@@ -625,9 +623,6 @@ for (int i = 0; i < length; ++i) {
 
 
 ## Open Questions
-- Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added?
-- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX?
--- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. 
 - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"
 
 - In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.

From 90a04f51e90d646d9b47e0afba1124b6aaaa822f Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 9 Sep 2022 11:33:44 -0700
Subject: [PATCH 05/27] replace matrix_use with use and matrix_layout with
 layout

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 76 +++++++++----------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index ba62e22f475f1..b49c0ed003326 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -61,7 +61,7 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
-          matrix_use use, matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
+          use use, layout Layout = layout::unused, typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };
@@ -78,7 +78,7 @@ Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumua
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-enum class matrix_use {
+enum class use {
   a,
   b,
   accumulator,
@@ -90,11 +90,11 @@ IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
 
 
 #### Layout
-Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
+Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
 	
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-enum class matrix_layout {
+enum class layout {
   row_major,
   col_major,
   packed,
@@ -113,7 +113,7 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported
 When the group is a `sycl::sub_group`, a matrix is declared as follows:
 
 ```c++
-joint_matrix<int8_t, tM, tN, matrix_use::matrix_a> tA(sg); 
+joint_matrix<int8_t, tM, tN, use::a> tA(sg); 
 ```
 
 
@@ -136,9 +136,9 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, access::address_space Space>
-  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, matrix_layout::unused, Group> &res,
-		    multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
+          use U, access::address_space Space>
+  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, layout::unused, Group> &res,
+		    multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
 This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
@@ -148,9 +148,9 @@ This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, access::address_space Space>	  
-  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, matrix_layout::unused, Group> &res,
-		     multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
+          use U, access::address_space Space>	  
+  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, layout::unused, Group> &res,
+		     multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
 This function stores the data from the 2d tiles back to memory.
@@ -161,10 +161,10 @@ This function stores the data from the 2d tiles back to memory.
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc,
           std::size_t M, std::size_t K, std::size_t N>
-  joint_matrix<Group, Td, M, N, matrix_use::accumulator, matrix_layout::unused, Group> joint_matrix_mad(Group sg,
-    joint_matrix<Ta, M, K, matrix_use::matrix_a, matrix_layout::unused, Group> A,
-    joint_matrix<Tb, K, N, matrix_use::matrix_b, matrix_layout::unused, Group> B,
-    joint_matrix<Tc, M, N, matrix_use::accumulator, matrix_layout::unused, Group> C);
+  joint_matrix<Group, Td, M, N, use::accumulator, layout::unused, Group> joint_matrix_mad(Group sg,
+    joint_matrix<Ta, M, K, use::a, layout::unused, Group> A,
+    joint_matrix<Tb, K, N, use::b, layout::unused, Group> B,
+    joint_matrix<Tc, M, N, use::accumulator, layout::unused, Group> C);
 }
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
@@ -176,7 +176,7 @@ The current interface presented above assumes that all the matrices are directly
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-           matrix_use U, matrix_layout L, typename Tv>
+           use U, layout L, typename Tv>
   void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &m, Tv v);
 }
 ```
@@ -216,18 +216,18 @@ The code listing below shows a synopsis of these new APIs.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, matrix_layout L = matrix_layout::unused,
+          use U, layout L = layout::unused,
           typename Group = sycl::sub_group>
 struct joint_matrix {
    wi_data<T, NumRows, NumCols, U, L, Group> get_wi_data();
 };
-template <typename T, size_t NumRows, size_t NumCols, matrix_use U, matrix_layout L, typename Group>
+template <typename T, size_t NumRows, size_t NumCols, use U, layout L, typename Group>
 class wi_data {
   size_t length();
   wi_element<T, NumRows, NumCols, U, L, Group> operator[](size_t i);
 };
 template <typename T, size_t NumRows, size_t NumCols,
-          matrix_use U, matrix_layout L = matrix_layout::unused,
+          use U, layout L = layout::unused,
           typename Group = sycl::sub_group>
 class wi_element {
   operator T();
@@ -314,14 +314,14 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    joint_matrix<int32_t, tM, tN, accumulator> tC(sg);
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tk) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed); // VNNI
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, layout::row_major);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, layout::packed); // VNNI
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = matC.get_wi_data();             
    for (int i = 0; i < wi_data_c.length(); i++)                
      wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major);
+   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
 }).wait();
 ```
 
@@ -403,12 +403,12 @@ struct tpu_params<
   static constexpr std::size_t defaultK =
       (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, matrix_use::a, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, matrix_use::b Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, matrix_use::accumulator, Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, use::a, Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, use::b Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations
                           // because Intel AMX hardware supports dynamic sizes
@@ -453,12 +453,12 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   static constexpr std::size_t defaultN = 16;
   static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, matrix_use::a, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, matrix_use::b Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::unused, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, matrix_use::accumulator Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, use::a, Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, use::b Layout, Group>;
+  template <layout Layout = layout::unused, typename Group = sub_group>
+  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
@@ -580,9 +580,9 @@ constexpr int msize_remainder = break_dimension_remainder(params, M);
 constexpr int nsize = params.combinations[0].nsize;
 constexpr int ksize = params.combinations[0].ksize;
 // device code:
-joint_matrix<int8_t, msize, ksize, matrix_use::a> sub_a(sg);
-joint_matrix<int8_t, ksize, nsize, matrix_use::b> sub_b(sg);
-joint_matrix<int, msize, nsize, matrix_use::accumulator> sub_c(sg);
+joint_matrix<int8_t, msize, ksize, use::a> sub_a(sg);
+joint_matrix<int8_t, ksize, nsize, use::b> sub_b(sg);
+joint_matrix<int, msize, nsize, use::accumulator> sub_c(sg);
 //Remainder handling
 ```
 
@@ -593,7 +593,7 @@ The current experimental API uses `joint_` semantics to define the memory scope
 
 
 ```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, matrix_use::a>>(sg);
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
 ```
 We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
 

From 40f888613221865cebb73958f88a81d8238c8653 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 9 Sep 2022 11:53:10 -0700
Subject: [PATCH 06/27] Move the order of use in joint matrix type to before
 the first optional argument

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 99 ++++++++++---------
 1 file changed, 50 insertions(+), 49 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index b49c0ed003326..8c0de08344633 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -60,8 +60,9 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
-          use use, layout Layout = layout::unused, typename Group = sub_group>
+template <typename T, use Use,
+          size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
+          layout Layout = layout::unused, typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };
@@ -91,7 +92,7 @@ IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
 
 #### Layout
 Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
-	
+
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 enum class layout {
@@ -113,7 +114,7 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported
 When the group is a `sycl::sub_group`, a matrix is declared as follows:
 
 ```c++
-joint_matrix<int8_t, tM, tN, use::a> tA(sg); 
+joint_matrix<int8_t, use::a, tM, tN> tA(sg);
 ```
 
 
@@ -123,7 +124,7 @@ We define three new functions needed to perform the main and common operations o
 The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed`). `stride` describes the number of elements between consecutive rows for row major and packed layouts, or between columns for the column major layout. 
 
 Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
-	
+
 IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
 
 Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
@@ -132,25 +133,27 @@ To be aligned with the SYCL 2020 group algorithms, an additional group argument
 
 IMPORTANT: In the current implementation, only the subgroup scope is supported.  
 
-#### Load 
+#### Load
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
           use U, access::address_space Space>
-  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, U, layout::unused, Group> &res,
-		    multi_ptr<T, Space> src, size_t stride, layout memL);
+  void joint_matrix_load(Group sg,
+    joint_matrix<T, U, NumRows, NumCols, layout::unused, Group> &res,
+    multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
 This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
 
 
-#### Store 
+#### Store
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use U, access::address_space Space>	  
-  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, U, layout::unused, Group> &res,
-		     multi_ptr<T, Space> src, size_t stride, layout memL);
+          use U, access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<T, U, NumRows, NumCols, layout::unused, Group> &res,
+    multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
 This function stores the data from the 2d tiles back to memory.
@@ -161,10 +164,10 @@ This function stores the data from the 2d tiles back to memory.
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc,
           std::size_t M, std::size_t K, std::size_t N>
-  joint_matrix<Group, Td, M, N, use::accumulator, layout::unused, Group> joint_matrix_mad(Group sg,
-    joint_matrix<Ta, M, K, use::a, layout::unused, Group> A,
-    joint_matrix<Tb, K, N, use::b, layout::unused, Group> B,
-    joint_matrix<Tc, M, N, use::accumulator, layout::unused, Group> C);
+  joint_matrix<Group, use::accumulator, Td, M, N, layout::unused, Group> joint_matrix_mad(Group sg,
+    joint_matrix<Ta, use::a, M, K, layout::unused, Group> A,
+    joint_matrix<Tb, use::b, K, N, layout::unused, Group> B,
+    joint_matrix<Tc, use::accumulator, M, N, layout::unused, Group> C);
 }
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
@@ -177,12 +180,12 @@ The current interface presented above assumes that all the matrices are directly
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
            use U, layout L, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, U, L, Group> &m, Tv v);
+  void joint_matrix_fill(Group sg, joint_matrix<T, U, NumRows, NumCols, L, Group> &m, Tv v);
 }
 ```
 IMPORTANT: In the current implementation, only the subgroup scope is supported.  
 
-#### Element Indexing and Piece-Wise Operations 
+#### Element Indexing and Piece-Wise Operations
 ##### Background
 Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
 
@@ -219,7 +222,7 @@ template <typename T, size_t NumRows, size_t NumCols,
           use U, layout L = layout::unused,
           typename Group = sycl::sub_group>
 struct joint_matrix {
-   wi_data<T, NumRows, NumCols, U, L, Group> get_wi_data();
+   wi_data<T, U, NumRows, NumCols, L, Group> get_wi_data();
 };
 template <typename T, size_t NumRows, size_t NumCols, use U, layout L, typename Group>
 class wi_data {
@@ -242,8 +245,8 @@ Vectorization along the subgroup dimension will get enabled automatically to vec
 
 
 ```c++
-auto wi_data_c = matC.get_wi_data();             
-for (int i = 0; i < wi_data_c.length(); i++)                
+auto wi_data_c = matC.get_wi_data();
+for (int i = 0; i < wi_data_c.length(); i++)
         wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
 ```
 
@@ -309,27 +312,27 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    const auto sg_startx = global_idx - item.get_local_id(0);
    const auto sg_starty = global_idy - item.get_local_id(1);
    sub_group sg = item.get_sub_group();
-   joint_matrix<int8_t, tM, tK, a> tA(sg);
-   joint_matrix<int8_t, tK, tN, b> tB(sg);
-   joint_matrix<int32_t, tM, tN, accumulator> tC(sg);
+   joint_matrix<int8_t, use::a, tM, tK> tA(sg);
+   joint_matrix<int8_t, use::b, tK, tN> tB(sg);
+   joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tk) {
      joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, layout::row_major);
      joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, layout::packed); // VNNI
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
-   auto wi_data_c = matC.get_wi_data();             
-   for (int i = 0; i < wi_data_c.length(); i++)                
+   auto wi_data_c = matC.get_wi_data();
+   for (int i = 0; i < wi_data_c.length(); i++)
      wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
    joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
 }).wait();
 ```
 
 == Query Interface
-Intel AMX, DPAS and Nvidia TPUs support different sizes and types. 
+Intel AMX, DPAS and Nvidia TPUs support different sizes and types.
 The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
 This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.   
+so there will be no runtime errors.
 The query interface proposed here consists of three functionalities:
 
 - Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
@@ -370,8 +373,6 @@ The table below provides a description for each of the member variables and type
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-
-
 template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int M=0, int N=0, int K=0>
 struct tpu_params;
 
@@ -404,11 +405,11 @@ struct tpu_params<
       (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
 
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, use::a, Layout, Group>;
+  using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, use::b Layout, Group>;
+  using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator, Layout, Group>;
+  using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations
                           // because Intel AMX hardware supports dynamic sizes
@@ -443,22 +444,22 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
                 "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
                 "and bf16 (Note that unsigned short should be used in the"
                 "DPC++ code to implement bf16) ");
-  
+
   using type_a = Ta; // this type alias is not available in the current implementation 
   using type_b = Tb; // this type alias is not available in the current implementation
   using type_c = Tc; // this type alias is not available in the current implementation
- 
+
   // construct the matrices using the default sizes
   static constexpr std::size_t defaultM = 16;
   static constexpr std::size_t defaultN = 16;
   static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
 
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, use::a, Layout, Group>;
+  using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, use::b Layout, Group>;
+  using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
   template <layout Layout = layout::unused, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, use::accumulator Layout, Group>;
+  using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
@@ -558,21 +559,21 @@ size_t NDRangeN = N / myparams::defaultN;
 
 === Default Values Example:
 ```c++
-using myparams = tpu_params_both<tpu::dpas, int8_t, int8_t, int>;  
-// use this to construct the ranges on the host side  
-size_t NDRangeM = M / myparams::defaultM;  
+using myparams = tpu_params_both<tpu::dpas, int8_t, int8_t, int>;
+// use this to construct the ranges on the host side
+size_t NDRangeM = M / myparams::defaultM;
 size_t NDRangeN = N / myparams::defaultN;
-//if M,N,K do not multiply the default sizes, padding has to be done 
-// device code: the matrices are constructed using the default dimensions  
-myparams::joint_matrix_a sub_a(sg);  
-myparams::joint_matrix_b sub_b(sg);  
+//if M,N,K do not multiply the default sizes, padding has to be done
+// device code: the matrices are constructed using the default dimensions
+myparams::joint_matrix_a sub_a(sg);
+myparams::joint_matrix_b sub_b(sg);
 myparams::joint_matrix_c sub_c(sg);
 
 ```
 
 === General Query Example:
 ```c++
-constexpr int M = 1500; // with msize = 8 and msize = 4, 
+constexpr int M = 1500; // with msize = 8 and msize = 4,
           // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
 tpu_params<tpu::dpas> params;
 constexpr int msize = break_dimension(params, M);
@@ -580,9 +581,9 @@ constexpr int msize_remainder = break_dimension_remainder(params, M);
 constexpr int nsize = params.combinations[0].nsize;
 constexpr int ksize = params.combinations[0].ksize;
 // device code:
-joint_matrix<int8_t, msize, ksize, use::a> sub_a(sg);
-joint_matrix<int8_t, ksize, nsize, use::b> sub_b(sg);
-joint_matrix<int, msize, nsize, use::accumulator> sub_c(sg);
+joint_matrix<int8_t, use::a, msize, ksize> sub_a(sg);
+joint_matrix<int8_t, use::b, ksize, nsize> sub_b(sg);
+joint_matrix<int, use::accumulator, msize, nsize> sub_c(sg);
 //Remainder handling
 ```
 

From 7843cff15851d893dd9d925017a75c8a5b34a85d Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Wed, 14 Sep 2022 14:08:07 -0500
Subject: [PATCH 07/27] Incorporate Jack's corrections

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 31 +++++++++----------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 8c0de08344633..9a8bb2c1a4489 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -75,23 +75,23 @@ The same class, `joint_matrix`, should handle both cases where sizes are constan
 IMPORTANT: In the current implementation, only the static extent is supported
 
 #### Use
-Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumuator (C) is required by backend implementations to reason about the layout of the matrix in regiters.
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 enum class use {
   a,
   b,
-  accumulator,
+  accumulator
 };
 }
 ```
 
-IMPORTANT: In both AMX and DPAS support, `use` template parameter is required
+IMPORTANT: In both AMX and DPAS support, the `use` template parameter is required
 
 
 #### Layout
-Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts.
+Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as packed layout.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -99,12 +99,12 @@ enum class layout {
   row_major,
   col_major,
   packed,
-  unused,
+  unused
 };
 }
 ```
 
-IMPORTANT: In both AMX and DPAS support, layout template parameter is unused and will be ignored if specified
+IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified
 
 #### Memory Scope
 In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
@@ -254,12 +254,11 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 
 IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
 
-IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. 
-
+IMPORTANT: In the Tensore Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.
 
 ## VNNI/Packed Layout
-Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
+Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
 
 #### Example 1: 16-bit elements
       // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
@@ -303,7 +302,7 @@ range<2> G = {M/tM, N};
 range<2> L = {1, SG_SIZE};
 int8_t *memA = malloc_shared<int8_t>(M*K, q);
 int8_t *memB = malloc_shared<int8_t>(K*N, q);
-Int32_t *memC = malloc_shared<int32_t>(M*N, q);
+int32_t *memC = malloc_shared<int32_t>(M*N, q);
 // Assuming memB has already been VNNIed
 q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
   [[sycl::reqd_sub_group_size(SG_SIZE)]] {
@@ -339,7 +338,7 @@ The query interface proposed here consists of three functionalities:
 
 - Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
 
-- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
 
 The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
 
@@ -599,8 +598,8 @@ multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<mat
 We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
 
 ### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class acesses only the portion of the current WI. It is not possible the location or coordinates of this portion in the original matrix.  This coordinates mapping  is implementation defined and change from one backend to the other.   For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information is needed to reason about the matrix view.
-With joint matrix, we want to write, as much as possible, one code to run on different backends. So if backend X states that a WI owns one exact row of the matrix for instance. Writing the following code will work only on that backend for that version of hardware. The hardware and implementations change, for instance, the same WI can own half of the row because SG size increased or hardware units increased. 
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
 
 ```c++
 auto data = C.get_wi_data();
@@ -609,9 +608,7 @@ for (int i = 0; i < length; ++i) {
 }
 ```
 
-
-
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-code this mapping, we write  general backend and target-agnostic, especially in the JIT compilation mode of SYCL. This is possible by querying this mapping so code does not have to change from one version to the other.
+We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
 
 So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
 ```c++

From e227765a55a4d8e3ef5a691a777da581626c4e93 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 22 Sep 2022 14:56:20 -0500
Subject: [PATCH 08/27] Incorporate some of Greg's comments. The other ones are
 still under discussion

---
 .../sycl_ext_oneapi_matrix.asciidoc            | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 9a8bb2c1a4489..ec96f4f6009fa 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -137,9 +137,9 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use U, access::address_space Space>
+          use Use, access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<T, U, NumRows, NumCols, layout::unused, Group> &res,
+    joint_matrix<T, Use, NumRows, NumCols, layout::unused, Group> &res,
     multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
@@ -150,9 +150,9 @@ This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use U, access::address_space Space>
+          use Use, access::address_space Space>
   void joint_matrix_store(Group sg,
-    joint_matrix<T, U, NumRows, NumCols, layout::unused, Group> &res,
+    joint_matrix<T, Use, NumRows, NumCols, layout::unused, Group> &res,
     multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
@@ -179,8 +179,8 @@ The current interface presented above assumes that all the matrices are directly
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-           use U, layout L, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<T, U, NumRows, NumCols, L, Group> &m, Tv v);
+           use Use, layout Layout, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &m, Tv v);
 }
 ```
 IMPORTANT: In the current implementation, only the subgroup scope is supported.  
@@ -219,18 +219,18 @@ The code listing below shows a synopsis of these new APIs.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t NumRows, size_t NumCols,
-          use U, layout L = layout::unused,
+          use Use, layout Layout = layout::unused,
           typename Group = sycl::sub_group>
 struct joint_matrix {
    wi_data<T, U, NumRows, NumCols, L, Group> get_wi_data();
 };
-template <typename T, size_t NumRows, size_t NumCols, use U, layout L, typename Group>
+template <typename T, size_t NumRows, size_t NumCols, use Use, layout Layout, typename Group>
 class wi_data {
   size_t length();
   wi_element<T, NumRows, NumCols, U, L, Group> operator[](size_t i);
 };
 template <typename T, size_t NumRows, size_t NumCols,
-          use U, layout L = layout::unused,
+          use Use, layout Layout = layout::unused,
           typename Group = sycl::sub_group>
 class wi_element {
   operator T();

From 1e53a9c9ce02d2ce7032d5f5d8c46b68d4c50317 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 22 Sep 2022 15:07:00 -0500
Subject: [PATCH 09/27] Update the references in sycl/ReleaseNotes.md to point
 to the new matrix spec file

---
 sycl/ReleaseNotes.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/sycl/ReleaseNotes.md b/sycl/ReleaseNotes.md
index 861a1f3ce163e..2e68b535cd32d 100644
--- a/sycl/ReleaseNotes.md
+++ b/sycl/ReleaseNotes.md
@@ -417,7 +417,7 @@ Release notes for commit range 23ca0c2..27f59d8
    Level Zero, ESIMD emulator, HIP [2b0ebab376dc]
  - Added support for `sycl::ext::intel::experimental::esimd_ballot` function
    [0bbb091c1baa]
- - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
+ - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
    [711ba58c30a8]
 
 ### Documentation
@@ -809,7 +809,7 @@ Release notes for commit range 4fc5ebe..bd68232
  - Added [sRGBA support](doc/extensions/supported/sycl_ext_oneapi_srgb.asciidoc)
    [e488327][191efdd]
  - Added a preview feature implementation for the DPC++ experimental
-   [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
+   [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
    [7f218531] [a95f46d]
  - Added support for SYCL 2020 exceptions [5c0f748][eef07606][5af8c43d]
  - Added support for [sycl_ext_intel_bf16_conversion extension](doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc)
@@ -1073,7 +1073,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
    for querying of free device memory in LevelZero backend extension [fa428bf]
  - Added [InvokeSIMD](doc/extensions/proposed/sycl_ext_oneapi_invoke_simd.asciidoc) and
    [Uniform](doc/extensions/proposed/sycl_ext_oneapi_uniform.asciidoc) extensions [72e1611]
- - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
+ - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
  - Implemented SYCL 2020 `sycl::span` [9356d53]
  - Added [device-if](doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) extension
    [4fb95fc]
@@ -1219,7 +1219,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
  - Fixed build issue when CUDA 11 is used [f7224f1]
  - Fixed caching of sub-devices in Level Zero backend[4c34f93]
  - Fixed requesting of USM memory allocation info on CUDA [691f842]
- - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
+ - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
    behaviour to return `A*B+C` instead of assigning the result to `C` [ea59c2b]
  - Workaround an issue in Level Zero backend when event isn't waited upon its
    completion but is queried for its status in an infinite loop  [bfef316]

From b2b5750260baeb7064c9bed942338c65c0aaab34 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Fri, 23 Sep 2022 11:36:25 -0500
Subject: [PATCH 10/27] Adding layout as template argument to load,store, mad
 functions

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index ec96f4f6009fa..b0904369c26ec 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -137,9 +137,9 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, access::address_space Space>
+          use Use, layout Layout, access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<T, Use, NumRows, NumCols, layout::unused, Group> &res,
+    joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
     multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
@@ -150,9 +150,9 @@ This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, access::address_space Space>
+          use Use, layout Layout, access::address_space Space>
   void joint_matrix_store(Group sg,
-    joint_matrix<T, Use, NumRows, NumCols, layout::unused, Group> &res,
+    joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
     multi_ptr<T, Space> src, size_t stride, layout memL);
 }
 ```
@@ -162,11 +162,11 @@ This function stores the data from the 2d tiles back to memory.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc,
-          std::size_t M, std::size_t K, std::size_t N>
+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA = layout::unused, layout LayoutB = layout::unused>
   joint_matrix<Group, use::accumulator, Td, M, N, layout::unused, Group> joint_matrix_mad(Group sg,
-    joint_matrix<Ta, use::a, M, K, layout::unused, Group> A,
-    joint_matrix<Tb, use::b, K, N, layout::unused, Group> B,
+    joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
+    joint_matrix<Tb, use::b, K, N, layoutB, Group> B,
     joint_matrix<Tc, use::accumulator, M, N, layout::unused, Group> C);
 }
 ```
@@ -222,12 +222,12 @@ template <typename T, size_t NumRows, size_t NumCols,
           use Use, layout Layout = layout::unused,
           typename Group = sycl::sub_group>
 struct joint_matrix {
-   wi_data<T, U, NumRows, NumCols, L, Group> get_wi_data();
+   wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data();
 };
 template <typename T, size_t NumRows, size_t NumCols, use Use, layout Layout, typename Group>
 class wi_data {
   size_t length();
-  wi_element<T, NumRows, NumCols, U, L, Group> operator[](size_t i);
+  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
 };
 template <typename T, size_t NumRows, size_t NumCols,
           use Use, layout Layout = layout::unused,

From e9ac406b24d2e6c3ecd344c7828f5c81cf173acb Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Fri, 23 Sep 2022 12:40:36 -0500
Subject: [PATCH 11/27] Adding clarification about layout based on Jack's
 review

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index b0904369c26ec..3beb0f966ceec 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -125,7 +125,7 @@ The base pointer determines the starting address of the matrix to be loaded/stor
 
 Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
 
-IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed` or `row_major`. Automatic VNNI transform is supported. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
+IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
 
 Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
 

From 867f280f21797d44b11d80771f2a7d2226b29ca3 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 6 Oct 2022 10:44:06 -0500
Subject: [PATCH 12/27] Moving towards a more convergent API between AMX, XMX
 and CUDA backend by:

- correct the way we use the feature test macro
- Add reference to the implementation macro
- Change the example to be portable on NVIDIA tensor cores as well
- Replace unused by dynamic and add invalid value to avoid default values for A and B
- Add overload of load function to match the CUDA backend implementation
---
 .../sycl_ext_oneapi_matrix.asciidoc           | 70 ++++++++++++-------
 1 file changed, 44 insertions(+), 26 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 3beb0f966ceec..22664e442c4b1 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -51,8 +51,19 @@ value to determine which of the extension's APIs the implementation supports.
 [frame="none",options="header"]
 |======================
 |Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
-|2     |joint matrix type has a new `use` parameter. `layout` on matrix is unused. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
+|======================
+
+## Matrix API Versions
+
+While this document presents the core API that unifies Intel AMX, DPAS, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The exact API used for this implementation is detailed in [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc)
+|2     |joint matrix type has a new `use` parameter. `layout` on matrix is optional for `C` matrix. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|3     |Implementation on Nvidia Tensor Cores
 |======================
 
 ## New `joint_matrix` class
@@ -62,7 +73,8 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, use Use,
           size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
-          layout Layout = layout::unused, typename Group = sub_group>
+          layout Layout = (Use == use::accumulator) ? layout::dynamic : layout::invalid, 
+          typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };
@@ -87,9 +99,6 @@ enum class use {
 }
 ```
 
-IMPORTANT: In both AMX and DPAS support, the `use` template parameter is required
-
-
 #### Layout
 Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as packed layout.
 
@@ -99,12 +108,13 @@ enum class layout {
   row_major,
   col_major,
   packed,
-  unused
+  dynamic,
+  invalid
 };
 }
 ```
 
-IMPORTANT: In both AMX and DPAS support, the `layout` template parameter is unused and will be ignored if specified
+IMPORTANT: In both AMX and DPAS support, the `layout` template parameter can be `dynamic` for Matrix A and B as well.
 
 #### Memory Scope
 In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
@@ -136,15 +146,24 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 #### Load
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          use Use, access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<T, Use, NumRows, NumCols, Layout::dynamic, Group> &res,
+    multi_ptr<T, Space> src, size_t stride, layout memL);
+    
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
           use Use, layout Layout, access::address_space Space>
   void joint_matrix_load(Group sg,
     joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
-    multi_ptr<T, Space> src, size_t stride, layout memL);
+    multi_ptr<T, Space> src, size_t stride);
 }
 ```
-This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
 
+`joint_matrix_load` loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
+We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
+The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`.
+The second overload without a memory layout uses static_assert must not used with a `joint_matrix` type that was declared with `layout::dynamic`.
 
 #### Store
 ```c++
@@ -163,11 +182,11 @@ This function stores the data from the 2d tiles back to memory.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
-            layout LayoutA = layout::unused, layout LayoutB = layout::unused>
-  joint_matrix<Group, use::accumulator, Td, M, N, layout::unused, Group> joint_matrix_mad(Group sg,
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
     joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
     joint_matrix<Tb, use::b, K, N, layoutB, Group> B,
-    joint_matrix<Tc, use::accumulator, M, N, layout::unused, Group> C);
+    joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> C);
 }
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
@@ -219,7 +238,7 @@ The code listing below shows a synopsis of these new APIs.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout = layout::unused,
+          use Use, layout Layout,
           typename Group = sycl::sub_group>
 struct joint_matrix {
    wi_data<T, Use, NumRows, NumCols, Layout, Group> get_wi_data();
@@ -230,7 +249,7 @@ class wi_data {
   wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
 };
 template <typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout = layout::unused,
+          use Use, layout Layout,
           typename Group = sycl::sub_group>
 class wi_element {
   operator T();
@@ -303,7 +322,6 @@ range<2> L = {1, SG_SIZE};
 int8_t *memA = malloc_shared<int8_t>(M*K, q);
 int8_t *memB = malloc_shared<int8_t>(K*N, q);
 int32_t *memC = malloc_shared<int32_t>(M*N, q);
-// Assuming memB has already been VNNIed
 q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
   [[sycl::reqd_sub_group_size(SG_SIZE)]] {
    const auto global_idx = item.get_global_id(0);
@@ -311,13 +329,13 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    const auto sg_startx = global_idx - item.get_local_id(0);
    const auto sg_starty = global_idy - item.get_local_id(1);
    sub_group sg = item.get_sub_group();
-   joint_matrix<int8_t, use::a, tM, tK> tA(sg);
-   joint_matrix<int8_t, use::b, tK, tN> tB(sg);
+   joint_matrix<int8_t, use::a, tM, tK, layout::row_major> tA(sg);
+   joint_matrix<int8_t, use::b, tK, tN, layout::row_major> tB(sg);
    joint_matrix<int32_t, use::accumulator, tM, tN> tC(sg);
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tk) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, layout::row_major);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, layout::packed); // VNNI
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = matC.get_wi_data();
@@ -403,11 +421,11 @@ struct tpu_params<
   static constexpr std::size_t defaultK =
       (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations
@@ -453,11 +471,11 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   static constexpr std::size_t defaultN = 16;
   static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
 
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
-  template <layout Layout = layout::unused, typename Group = sub_group>
+  template <layout Layout, typename Group = sub_group>
   using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because

From db9cfda3b04f0f7c3b8e4bbab37b9c39958ca281 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 6 Oct 2022 10:49:32 -0500
Subject: [PATCH 13/27] typos correction

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 22664e442c4b1..a6f95525f1ef5 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -124,7 +124,7 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported
 When the group is a `sycl::sub_group`, a matrix is declared as follows:
 
 ```c++
-joint_matrix<int8_t, use::a, tM, tN> tA(sg);
+joint_matrix<int8_t, use::a, tM, tN, layout::row_major> tA(sg);
 ```
 
 
@@ -163,7 +163,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 `joint_matrix_load` loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
 We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
 The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`.
-The second overload without a memory layout uses static_assert must not used with a `joint_matrix` type that was declared with `layout::dynamic`.
+The second overload without a memory layout must not be used with a `joint_matrix` type that was declared with `layout::dynamic`.
 
 #### Store
 ```c++

From f5b3d83ad92fd4ac1ed9a1bdd4bb23cafd205bcf Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 6 Oct 2022 10:57:15 -0500
Subject: [PATCH 14/27] For consistency, replace ctype with accumulatortype and
 joint_matrix_c with joint_matrix_accumulator in the query interface

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 26 +++++++++----------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index a6f95525f1ef5..ec0ae98973395 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -354,9 +354,9 @@ The query interface proposed here consists of three functionalities:
 
 - Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
 
-- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
 
-- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
 
 The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
 
@@ -365,20 +365,20 @@ The table below provides a description for each of the member variables and type
 | Member/type alias in `tpu_params` | Forms they are defined in |Description
 |`type_a`| validation, default values|type alias for the type of matrix A
 |`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_c`|  validation, default values|type alias for the type of matrix C
+|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
 |`defaultM`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
 |`defaultN`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
 |`defaultK`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
 |`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
 |`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
-|`joint_matrix_c`|  validation, default values| type alias for `joint_matrix` for matrix C
+|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
 |`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false)
 |numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS)
 |scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination
+|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
 |`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
 |`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination
+|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
 |`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
 |`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
 |======================
@@ -412,7 +412,7 @@ struct tpu_params<
 
   using type_a = Ta; // this type alias is not available in the current implementation 
   using type_b = Tb; // this type alias is not available in the current implementation
-  using type_c = Tc; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
 
   // if combination is valid, construct the matrices
 
@@ -426,7 +426,7 @@ struct tpu_params<
   template <layout Layout, typename Group = sub_group>
   using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
   template <layout Layout, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
+  using joint_matrix_accumulator = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations
                           // because Intel AMX hardware supports dynamic sizes
@@ -441,7 +441,7 @@ struct tpu_params<
     uint32_t ksize;
     matrix_type atype;
     matrix_type btype;
-    matrix_type ctype;
+    matrix_type accumulatortype;
   };
   // In this case, the combinations array contains only the combination that the user provided
   static constexpr combination combinations[] = {
@@ -476,7 +476,7 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   template <layout Layout, typename Group = sub_group>
   using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
   template <layout Layout, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
+  using joint_matrix_accumulator = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
@@ -491,7 +491,7 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
     uint32_t ksize;
     matrix_type atype;
     matrix_type btype;
-    matrix_type ctype;
+    matrix_type accumulatortype;
   };
   // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
   // types that the user provided
@@ -518,7 +518,7 @@ struct tpu_params<tpu::amx, void, void, void, M, N, K> {
     uint32_t ksize;
     matrix_type atype;
     matrix_type btype;
-    matrix_type ctype;
+    matrix_type accumulatortype;
   };
   
   static constexpr combination combinations[] = {
@@ -584,7 +584,7 @@ size_t NDRangeN = N / myparams::defaultN;
 // device code: the matrices are constructed using the default dimensions
 myparams::joint_matrix_a sub_a(sg);
 myparams::joint_matrix_b sub_b(sg);
-myparams::joint_matrix_c sub_c(sg);
+myparams::joint_matrix_accumulator sub_c(sg);
 
 ```
 

From e8e652c9e76c0755195379008603bd45e5f9ee10 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 6 Oct 2022 11:06:18 -0500
Subject: [PATCH 15/27] more typos

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index ec0ae98973395..9d702e4d297bb 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -464,7 +464,7 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
 
   using type_a = Ta; // this type alias is not available in the current implementation 
   using type_b = Tb; // this type alias is not available in the current implementation
-  using type_c = Tc; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
 
   // construct the matrices using the default sizes
   static constexpr std::size_t defaultM = 16;

From 7ebe2a798087b9abb93a5665c2055bd92bcb9d68 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 6 Oct 2022 13:50:43 -0500
Subject: [PATCH 16/27] Change the names default sizes in the query from
 defaultM, defaultN, defaultK to M,N,K

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 55 +++++++++----------
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 9d702e4d297bb..ec896c6ee5651 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -366,9 +366,9 @@ The table below provides a description for each of the member variables and type
 |`type_a`| validation, default values|type alias for the type of matrix A
 |`type_b`|  validation, default values|type alias for the type of matrix B
 |`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
-|`defaultM`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
-|`defaultN`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
-|`defaultK`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
+|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
+|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
 |`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
 |`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
 |`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
@@ -390,21 +390,21 @@ The table below provides a description for each of the member variables and type
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int M=0, int N=0, int K=0>
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
 struct tpu_params;
 
 // Validation form: Valid or not
 // Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int M, int N, int K>
+template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK>
 struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, M, N, K,
+    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
     typename std::enable_if<(
         !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && M != 0 && N != 0 && K != 0)>::type> {
+        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
   // Validate that parameters are supported
   static_assert(
-      (M == 0 && N == 0 && K == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(M, N, K)),
+      (sM == 0 && sN == 0 && sK == 0) ||
+          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
       "Invalid parameters for Intel AMX, query valid types and maximum sizes "
       "using: "
       "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
@@ -416,10 +416,10 @@ struct tpu_params<
 
   // if combination is valid, construct the matrices
 
-  static constexpr std::size_t defaultM = (M != 0) ? M : 16;
-  static constexpr std::size_t defaultN = (N != 0) ? N : 16;
-  static constexpr std::size_t defaultK =
-      (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
+  static constexpr std::size_t M = (sM != 0) ? sM : 16;
+  static constexpr std::size_t N = (sN != 0) ? sN : 16;
+  static constexpr std::size_t K =
+      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
 
   template <layout Layout, typename Group = sub_group>
   using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
@@ -445,7 +445,7 @@ struct tpu_params<
   };
   // In this case, the combinations array contains only the combination that the user provided
   static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, M, N, K}};
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
   static constexpr int num_combinations =
       sizeof(combinations) / sizeof(combination);
 };
@@ -467,16 +467,16 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   using type_accumulator = Tc; // this type alias is not available in the current implementation
 
   // construct the matrices using the default sizes
-  static constexpr std::size_t defaultM = 16;
-  static constexpr std::size_t defaultN = 16;
-  static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
+  static constexpr std::size_t M = 16;
+  static constexpr std::size_t N = 16;
+  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
 
   template <layout Layout, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, use::a, defaultM, defaultK, Layout, Group>;
+  using joint_matrix_a = joint_matrix<Ta, use::a, M, K, Layout, Group>;
   template <layout Layout, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, use::b, defaultK, defaultN, Layout, Group>;
+  using joint_matrix_b = joint_matrix<Tb, use::b, K, N, Layout, Group>;
   template <layout Layout, typename Group = sub_group>
-  using joint_matrix_accumulator = joint_matrix<Tc, use::accumulator, defaultM, defaultN, Layout, Group>;
+  using joint_matrix_accumulator = joint_matrix<Tc, use::accumulator, M, N, Layout, Group>;
 
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
@@ -503,8 +503,8 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
 
 // General query form:
 // types are not given, no default sizes and no implicit matrix construction
-template <int M, int N, int K>
-struct tpu_params<tpu::amx, void, void, void, M, N, K> {
+template <int sM, int sN, int sK>
+struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
   static constexpr bool dynamic_p = false; // should be true in future implementations because
                           // Intel AMX hardware supports dynamic sizes
   static constexpr uint32_t numtiles = 8;
@@ -570,17 +570,17 @@ enum class scope_t {
 // User can provide sizes besides the types and tpu_params can assert if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
 using myparams = tpu_params<tpu::dpas, int8_t, int8_t, int, 16, 8, 32>;  
-size_t NDRangeM = M / myparams::defaultM;  //Assertion would happen at this line
-size_t NDRangeN = N / myparams::defaultN;
+size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
+size_t NDRangeN = N / myparams::N;
 ```
 
 === Default Values Example:
 ```c++
 using myparams = tpu_params_both<tpu::dpas, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
-size_t NDRangeM = M / myparams::defaultM;
-size_t NDRangeN = N / myparams::defaultN;
-//if M,N,K do not multiply the default sizes, padding has to be done
+size_t NDRangeM = M / myparams::M;
+size_t NDRangeN = N / myparams::N;
+//if M, N, K do not multiply the default sizes, padding has to be done
 // device code: the matrices are constructed using the default dimensions
 myparams::joint_matrix_a sub_a(sg);
 myparams::joint_matrix_b sub_b(sg);
@@ -645,7 +645,6 @@ for (int i = 0; i < length; ++i) {
 
 ## TODO List
 - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
 - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
 - Add a more realistic and complete example that shows the value of the general query
 

From 921df9e142c6615c24c2f25b7411856c08d2ba72 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Tue, 11 Oct 2022 10:15:19 -0500
Subject: [PATCH 17/27] remove layout::invalid as this was suggested for the
 implementation not the API

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc     | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index ec896c6ee5651..87dc436bb7739 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -108,8 +108,7 @@ enum class layout {
   row_major,
   col_major,
   packed,
-  dynamic,
-  invalid
+  dynamic
 };
 }
 ```

From 95f06ba521ebcef6fff8f8a844e4c0832daea230 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Tue, 11 Oct 2022 10:18:24 -0500
Subject: [PATCH 18/27] Update sycl_ext_oneapi_matrix.asciidoc

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc     | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 87dc436bb7739..a2ca7c1a908fe 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -73,8 +73,7 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, use Use,
           size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
-          layout Layout = (Use == use::accumulator) ? layout::dynamic : layout::invalid, 
-          typename Group = sub_group>
+          layout Layout, typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };

From 0cb74626d688f6d26817304f09439000ef265f76 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Tue, 11 Oct 2022 10:22:42 -0500
Subject: [PATCH 19/27] put back layout::dynamic as the default, removed it by
 accident

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index a2ca7c1a908fe..d72968f803574 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -73,7 +73,7 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename T, use Use,
           size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
-          layout Layout, typename Group = sub_group>
+          layout Layout = layout::dynamic, typename Group = sub_group>
 struct joint_matrix {
     joint_matrix(Group g) {}
 };
@@ -147,7 +147,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
           use Use, access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<T, Use, NumRows, NumCols, Layout::dynamic, Group> &res,
+    joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res,
     multi_ptr<T, Space> src, size_t stride, layout memL);
     
   template <typename Group, typename T, size_t NumRows, size_t NumCols,

From 8a5a253e61d0d5a98d6e7f3f11999c983796ff0b Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Tue, 11 Oct 2022 11:22:48 -0500
Subject: [PATCH 20/27] Incorporate Jack's review comments

---
 .../sycl_ext_oneapi_matrix.asciidoc                    | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index d72968f803574..1a8be449edde9 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -20,20 +20,20 @@
 
 == Notice
 
-Copyright (c) 2021-2021 Intel Corporation.  All rights reserved.
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
 
 NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
 trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
 used by permission by Khronos.
 
-This extension is written against the SYCL 2020 revision 3 specification.  All
+This extension is written against the SYCL 2020 revision 5 specification.  All
 references below to the "core SYCL specification" or to section numbers in the
 SYCL specification refer to that revision.
 
 
 **_NOTE:_** _This document describes the current design and API for the matrix
 extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS_
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS and Nvidia(R) Tensor Cores._
 
 ## Introduction
 This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
@@ -271,7 +271,7 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 
 IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
 
-IMPORTANT: In the Tensore Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.
+IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend.
 
 ## VNNI/Packed Layout
 Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
@@ -641,6 +641,8 @@ for (int i = 0; i < length; ++i) {
 
 - In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
 
+- `dynamic_extent` on the shape of `joint_matrix` is only available on Intel AMX. Should this be part of the API?  
+
 ## TODO List
 - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
 - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups

From c5808b4959a6f6e8396c71b8125dc903623e0284 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Fri, 14 Oct 2022 10:42:16 -0500
Subject: [PATCH 21/27] correct joint_matrix_mad signature

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 1a8be449edde9..65eddf5698f73 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -181,7 +181,7 @@ This function stores the data from the 2d tiles back to memory.
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
             layout LayoutA, layout LayoutB>
-  joint_matrix<Group, use::accumulator, Td, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
+  joint_matrix<Td, use::accumulator, M, N, layout::dynamic, Group> joint_matrix_mad(Group sg,
     joint_matrix<Ta, use::a, M, K, layoutA, Group> A,
     joint_matrix<Tb, use::b, K, N, layoutB, Group> B,
     joint_matrix<Tc, use::accumulator, M, N, layout::dynamic, Group> C);

From e05ade586d6f45b8c1f61045162eaaea14bb2a3a Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Mon, 17 Oct 2022 12:55:22 -0500
Subject: [PATCH 22/27] Add IsDecorated to multi_ptr type

---
 .../sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc         | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
index 71affb2cbb255..8f7115b59f5c8 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
@@ -132,7 +132,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
           matrix_layout Layout,
           access::address_space Space>
   void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
-		    multi_ptr<T, Space> src, size_t stride, matrix_layout MemLayout);
+		    multi_ptr<T, Space, IsDecorated> src, size_t stride, matrix_layout MemLayout);
 }
 ```
 This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
@@ -145,7 +145,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
           matrix_layout L,
           access::address_space Space>	  
   void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,
-		     multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
+		     multi_ptr<T, Space, IsDecorated> src, size_t stride, matrix_layout memL);
 }
 ```
 This function stores the data from the 2d tiles back to memory.

From b1568607aef2bec711d53441ba9c8192d4bcbc1b Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Mon, 17 Oct 2022 13:31:41 -0500
Subject: [PATCH 23/27] Add IsDecorated new argument to multi_ptr type

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc  | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 65eddf5698f73..38484431ccb1a 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -148,13 +148,13 @@ namespace sycl::ext::oneapi::experimental::matrix {
           use Use, access::address_space Space>
   void joint_matrix_load(Group sg,
     joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res,
-    multi_ptr<T, Space> src, size_t stride, layout memL);
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
     
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
           use Use, layout Layout, access::address_space Space>
   void joint_matrix_load(Group sg,
     joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
-    multi_ptr<T, Space> src, size_t stride);
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
 }
 ```
 
@@ -170,7 +170,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
           use Use, layout Layout, access::address_space Space>
   void joint_matrix_store(Group sg,
     joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
-    multi_ptr<T, Space> src, size_t stride, layout memL);
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
 }
 ```
 This function stores the data from the 2d tiles back to memory.

From ebe5d8d1703d60c4935d42a26f9c0a6ad6446089 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 21 Oct 2022 12:12:11 -0700
Subject: [PATCH 24/27] move deprecated extension to the deprecated folder
 based on Pavel's suggestion

---
 .../sycl_ext_oneapi_matrix_no_use.asciidoc}                       | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename sycl/doc/extensions/{experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc => deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc} (100%)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc b/sycl/doc/extensions/deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc
similarity index 100%
rename from sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc
rename to sycl/doc/extensions/deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc

From 9ce72731c587a25388b665d0ece9f52af88a377b Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Wed, 26 Oct 2022 15:00:07 -0500
Subject: [PATCH 25/27] Add an open question about how to deal with non
 portable scenarios in this document

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc       | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 38484431ccb1a..f9f34ddf89648 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -642,6 +642,7 @@ for (int i = 0; i < length; ++i) {
 - In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
 
 - `dynamic_extent` on the shape of `joint_matrix` is only available on Intel AMX. Should this be part of the API?  
+- This document still contains non-portable code between Intel AMX and DPAS, and Nvidia Tensor Cores such as: packed layout, dynamic layout on joint matrix type of use A and B, store of matrix A and B. Currently, these restrictions are explained in the spec text. But we might decide to move these to a separate Intel-specific additional matrix API document. 
 
 ## TODO List
 - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 

From bc1c2cb6895cd0c86f4a54fb6fcdc27cfc5d1a79 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 27 Oct 2022 11:31:09 -0500
Subject: [PATCH 26/27] restrict layout, load and store to the portable API

---
 .../sycl_ext_oneapi_matrix.asciidoc             | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index f9f34ddf89648..6980c38dc9d5b 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -61,8 +61,8 @@ While this document presents the core API that unifies Intel AMX, DPAS, and Nvid
 [frame="none",options="header"]
 |======================
 |Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The exact API used for this implementation is detailed in [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc)
-|2     |joint matrix type has a new `use` parameter. `layout` on matrix is optional for `C` matrix. JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|1     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in [matrix extension](doc/extensions/deprecated/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc)
+|2     |JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
 |3     |Implementation on Nvidia Tensor Cores
 |======================
 
@@ -80,6 +80,8 @@ struct joint_matrix {
 }
 ```
 
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
+
 #### Shape
 The same class, `joint_matrix`, should handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
 
@@ -112,7 +114,6 @@ enum class layout {
 }
 ```
 
-IMPORTANT: In both AMX and DPAS support, the `layout` template parameter can be `dynamic` for Matrix A and B as well.
 
 #### Memory Scope
 In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
@@ -145,9 +146,9 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, access::address_space Space>
+            access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<T, Use, NumRows, NumCols, layout::dynamic, Group> &res,
+    joint_matrix<T, use::accumulator, NumRows, NumCols, layout::dynamic, Group> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
     
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
@@ -167,9 +168,9 @@ The second overload without a memory layout must not be used with a `joint_matri
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout, access::address_space Space>
+            access::address_space Space>
   void joint_matrix_store(Group sg,
-    joint_matrix<T, Use, NumRows, NumCols, Layout, Group> &res,
+    joint_matrix<T, use::accumulator, NumRows, NumCols, layout::dynamic, Group> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
 }
 ```
@@ -642,7 +643,7 @@ for (int i = 0; i < length; ++i) {
 - In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
 
 - `dynamic_extent` on the shape of `joint_matrix` is only available on Intel AMX. Should this be part of the API?  
-- This document still contains non-portable code between Intel AMX and DPAS, and Nvidia Tensor Cores such as: packed layout, dynamic layout on joint matrix type of use A and B, store of matrix A and B. Currently, these restrictions are explained in the spec text. But we might decide to move these to a separate Intel-specific additional matrix API document. 
+- This document still contains non-portable code between Intel AMX and DPAS, and Nvidia Tensor Cores such as: packed layout and dynamic_extent. Currently, these restrictions are explained in the spec text. But we might decide to move these to a separate Intel-specific additional matrix API document. 
 
 ## TODO List
 - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 

From f8a4587e6cfe6a34d6a7a2b3d467e43fc20bc9e0 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Thu, 27 Oct 2022 13:49:03 -0500
Subject: [PATCH 27/27] clarification on the store in the text

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 6980c38dc9d5b..5d23485d79cfc 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -174,7 +174,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
     multi_ptr<T, Space, IsDecorated> src, size_t stride, layout memL);
 }
 ```
-This function stores the data from the 2d tiles back to memory.
+This function stores the data in the accumulator matrix from the 2d tiles back to memory.
 
 #### Multiply and Add