Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Matrix] Add get-coord API and general query example #7964

Merged
merged 53 commits into from
Aug 28, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
9628bd0
- Remove the general query from TODO list as an example is added to t…
dkhaldi Jan 9, 2023
39875df
add an other distribution example
dkhaldi Jan 9, 2023
e42ef4a
add revision history
dkhaldi Jan 10, 2023
8bb98c1
Bader comments
dkhaldi Jan 10, 2023
48386d6
better wording
dkhaldi Jan 10, 2023
1e85155
Incorporate Greg comments and other improvements, specifically:
dkhaldi Jan 12, 2023
6f91525
Update the specification document to follow the formal template
dkhaldi Jan 30, 2023
cdcab5a
add tf32 type and conversion function
dkhaldi Jan 30, 2023
04e18fe
correct the matrix types in the appendix
dkhaldi Jan 30, 2023
9403a38
correct the matrix types in the appendix
dkhaldi Jan 30, 2023
ddb87f1
remove _t from the types
dkhaldi Jan 30, 2023
8a8e0a9
Specify in Status that joint matrix is an optional kernel feature
dkhaldi Feb 4, 2023
7e610aa
Move the iteration-style EWOps to the Intel extension and introduce j…
dkhaldi Feb 9, 2023
509056c
Address Jack's comments
dkhaldi Feb 10, 2023
805630c
Add get_info runtime query
dkhaldi Feb 13, 2023
20c09c9
reword the optional device feature checking
dkhaldi Feb 14, 2023
a7494c8
Address Greg's comments
dkhaldi Feb 28, 2023
7159591
Incorporate the last batch of Greg's comments
dkhaldi Feb 28, 2023
5b9fdfc
incorporate Greg's comments: query syntax
dkhaldi Mar 2, 2023
e0f683e
use sycl::ext::oneapi::experimental::architecture and remove scope query
dkhaldi Mar 2, 2023
008dbfc
fix the comments formatting
dkhaldi Mar 2, 2023
efb103a
- Add overloads and explanation for each of the API in the tf32 section
dkhaldi Mar 6, 2023
e69ff85
typo
dkhaldi Mar 6, 2023
6868a37
Address Greg's comments in the Intel extension
dkhaldi Mar 11, 2023
fb70d27
Add overload of joint matrix apply where row and col are provided
dkhaldi Mar 20, 2023
433e65a
Address Greg's comments: change packed name, add tf32 rounding mode, …
dkhaldi Mar 23, 2023
f5694eb
fix formatting
dkhaldi Mar 23, 2023
862880e
Address Greg's comments: remove loop-based indexing, add Td and defau…
dkhaldi Apr 24, 2023
885cf09
Incorporate Greg's suggestions
dkhaldi May 23, 2023
d0a81af
Incorporate Greg's small comments in intel-specific spec
dkhaldi May 23, 2023
cd41588
Rename folder name, add primary definition of matrix_params
dkhaldi May 25, 2023
0bf47c9
Add missing const to multi_ptr
dkhaldi May 25, 2023
15306d6
- Add copy function; - Add clarification about copy constructor and a…
dkhaldi May 30, 2023
bee344e
small typo correction
dkhaldi May 31, 2023
e5648e4
Remove default copy constructor and assign op
dkhaldi Jun 7, 2023
e22d057
fixed merge conflicts without merging and add Jack's Nvidia combinati…
dkhaldi Jun 8, 2023
0b4eecc
Remove the oneapi matrix folder that is replaced here by matrix folde…
dkhaldi Jun 8, 2023
8d80ad6
Add old folder to try to fix conflicts
dkhaldi Jun 9, 2023
1059870
Merge branch 'intel:sycl' into get-coord-doc
dkhaldi Jun 9, 2023
35c8744
remove the old folder that resulted from the merge with sycl branch
dkhaldi Jun 9, 2023
d63bdb8
address Greg's comments: change Nvidia table, minor formatting
dkhaldi Jun 29, 2023
7bfb8e5
corrected two types in the Nvidia table
dkhaldi Jun 29, 2023
08fd2db
address Greg, Jack, and Alexey comments
dkhaldi Jul 28, 2023
d7d0a70
Clarify use of must when referring to the query interface
dkhaldi Jul 31, 2023
bf8e00c
Address Greg's comments: fix 2 broken lines, const multi_ptr, line wrap
dkhaldi Aug 2, 2023
84af291
Add clarifications about joint_matrix_copy
dkhaldi Aug 2, 2023
2c2af7d
Add non const overload to tf32 load as implicit conversion for multi_…
dkhaldi Aug 7, 2023
e8bde89
minor clarification
dkhaldi Aug 9, 2023
a7f92ce
fix width of query table
dkhaldi Aug 23, 2023
789b593
fix the width for the right table
dkhaldi Aug 25, 2023
ee28250
Avoid line breaks in table by using source block
gmlueck Aug 25, 2023
2d80d16
add the conflicted file first in order to resolve the conflict
dkhaldi Aug 28, 2023
901252b
Merge branch 'intel:sycl' into get-coord-doc
dkhaldi Aug 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add get_info runtime query
  • Loading branch information
dkhaldi committed Feb 13, 2023
commit 805630cdbb965baf6c6fa3157fda915f8e539139
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,11 @@ XMX.
== Overview
The Intel backend implementations on both Intel AMX and Intel XMX
gmlueck marked this conversation as resolved.
Show resolved Hide resolved
support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`,
`joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query
interface, as they are defined in the sycl_ext_oneapi_matrix
extension. There are additional specifics about the supported layouts
that enable extra performance and functionality listed in this
document.
`joint_matrix_mad`, `joint_matrix_fill`, `joint_matrix_apply`, and the
query interface, as they are defined in the sycl_ext_oneapi_matrix
extension. Besides element-wise operations with mapping information,
there are additional specifics about the supported layouts that enable
extra performance and functionality listed in this document.
This extension presents some supplementary Intel AMX and Intel XMX
features not contained within the sycl_ext_oneapi_matrix
extension. The additional features are built on top of the
Expand All @@ -75,11 +75,11 @@ AMX and Intel XMX backends.

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must
predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
Applications can test for the existence of this macro to determine if the
implementation supports this feature, or applications can test the macro's
value to determine which of the extension's APIs the implementation
supports.
predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values
defined in the table below.Applications can test for the existence of
gmlueck marked this conversation as resolved.
Show resolved Hide resolved
this macro to determine if the implementation supports this feature,
or applications can test the macro's value to determine which of the
extension's APIs the implementation supports.

[%header,cols="1,5"]
|===
Expand Down Expand Up @@ -213,7 +213,7 @@ order to reason about the matrix view and extract the relevant
piece. However, for element-wise operations where the same operation
is performed on all the elements of the matrix, having all the WIs in
the group apply the operation inside a loop iterating over the
`length` of `wi_data` guarantees the whole matrix element-wise operation.
`length` of `wi_data` guarantees the whole matrix element-wise operation.

Note that `get_wi_data` cannot return a fixed size array length
because the length of the WI portion is a runtime variable for the
Expand Down Expand Up @@ -248,7 +248,7 @@ class wi_element {
wi_element &operator*=(const T &rhs);
wi_element &operator/=(const T &rhs);

std::tuple<size_t, size_t> get_coord();
std::tuple<size_t, size_t> get_coord();
};
}
```
Expand All @@ -257,14 +257,14 @@ In the following example `wi_data_c` is a reference to the WI owned
portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
updates the corresponding matrix element in the joint_matrix `matC`.
Vectorization along the sub group dimension will get enabled
automatically to vectorize the contiguous portion of the matrix.
automatically to vectorize the contiguous portion of the matrix.


```c++
auto wi_data_c = get_wi_data(sg, matC);
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] *= alpha; // Note that the indexing here "i"
is in the vector owned by a WI, not in the matrix C
is in the vector owned by a WI, not in the matrix C
```

IMPORTANT: In the current implementation, only the `sub_group` scope
Expand All @@ -287,7 +287,7 @@ auto data = get_wi_data(sg, tA);
for (int i = 0; i < data.length(); ++i) {
auto [row, col] = data[i].get_coord();
sum_of_local_rows[row] += data[i];
}
}
```

IMPORTANT: `get_coord` is not implemented yet.
Expand All @@ -314,7 +314,7 @@ for a 16-bit type.
// a3, b3, c3, d3
// a4, b4, c4, d4
// ---------------------------------
// The same matrix reformatted in packed layout.
// The same matrix reformatted in packed layout.
// Here, packing of 2 elements is needed to form 32 bits.
// Element a1 is contiguous in memory with element a2, etc.
// ---------------------------------
Expand All @@ -332,7 +332,7 @@ for a 16-bit type.
// a3, b3, c3, d3
// a4, b4, c4, d4
// ---------------------------------
// The same matrix reformatted in packed layout.
// The same matrix reformatted in packed layout.
// Here, packing of 4 elements is needed to form 32 bits.
// Elements a1, a2, a3, a4 are contiguous in memory, etc.
// ---------------------------------
Expand All @@ -348,7 +348,7 @@ range<2> L = {1, SG_SIZE};
int8_t *memA = malloc_shared<int8_t>(M*K, q);
int8_t *memB = malloc_shared<int8_t>(K*N, q);
int32_t *memC = malloc_shared<int32_t>(M*N, q);
q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
[[sycl::reqd_sub_group_size(SG_SIZE)]] {
const auto global_idx = item.get_global_id(0);
const auto global_idy = item.get_global_id(1);
Expand All @@ -366,12 +366,12 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
sg_startx * tM * K + k, K);
joint_matrix_load(sg, tB,
multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
tC = joint_matrix_mad(sg, tA, tB, tC);
}
auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] *= alpha;
wi_data_c[i] *= alpha;
joint_matrix_store(sg, tC,
multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
Expand Down
Loading