-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Matrix] Add get-coord API and general query example #7964
Conversation
dkhaldi
commented
Jan 9, 2023
- Remove the general query from TODO list as an example is added to the llvm-test-suite ([SYCL][Matrix] Add a more general query example llvm-test-suite#1492)
- Add get coord API and remove it from TODO list
- Remove the local memory future API looking as it is no more relevant
…he llvm-test-suite - Add get coord API and remove it from TODO list - Remove the local memory future API looking as it is no more relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few fixes: markdown linter issues and one typo.
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
Looks pretty good to me. For the Query interface, I think it would be good to try to get some community feedback if possible. I suppose the argument for the general query is that with several backends, it could be easier to ask the API for the set of valid combinations rather than search for the documentation. This is fair. Although I think we should still make an effort to make documentation of supported sizes/types for different backends as accessible and clear as possible; so that people are not forced to use the general query when they may prefer just looking at the docs. At the moment the documentation for supported types is in e.g. sycl_ext_intel_matrix doc. For the Nvidia case a current slight problem is that we don't actually have any Nvidia only features at the moment, so it is a bit of a misnomer to have a e.g. sycl_ext_cuda_matrix doc similar to what I have here (https://github.com/intel/llvm/pull/6968/files) which currently only lists the currently supported values of sycl_ext_oneapi_matrix APIs in the ext_oneapi_cuda backend. In the future even if we do add the cuda only matrix features there can be other backends that encounter the situation where they need to document supported values of sycl_ext_oneapi_matrix APIs in that backend but don't have a backend specific matrix features extension. I thought there could be two better options. a) I can rename sycl_ext_oneapi_matrix_cuda.asciidoc sycl_ext_oneapi_matrix_cuda_supported_vals.asciidoc or similar, remove all the dpc++ extension boilerplate docs from that doc, just leaving the supported value information. Then move the "Supported Combinations Per Hardware" section in sycl_ext_intel_matrix.asciidoc to a similar file like sycl_ext_oneapi_matrix_intel_supported_vals.asciidoc. OR b) we just move the "Supported Combinations Per Hardware" section for all backends to the main sycl_ext_oneapi_matrix.asciidoc doc and I just delete this file completely: https://github.com/intel/llvm/blob/e50a2f5f97acb12db1de78c9ad739b931c77b03f/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_cuda_matrix.asciidoc. What do you think? cc @gmlueck also. |
If we want to document all the matrix constraints for each device, I think it probably makes sense to document them all in a single table (i.e. in the same document). For now, this could be a non-normative appendix in the main "sycl_ext_oneapi_matrix.asciidoc" document. If the matrix API is eventually adopted into the core SYCL language (and the extension goes away), we will need to find some other place to list these constraints, but we can worry about this later. |
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
Sounds good to me. |
You should document what your implementation is supporting, not what Nvidia hardware supports. In the joint matrix code in the CUDA backend, there are very specific combinations that are allowed, this is what should be documented and returned by this query. It is worth mentioning that what we specify in the documentation and the query is not what the hardware supports (note that the XMX sizes are disclosed information). We document what the implementation can do in an optimal way. You can refer to them as logical sizes rather than hardware sizes. In all cases, performance kernels should care about the maximum load it can do at a time not about the matrix hardware mad instruction. Then, reuse that in an optimal way and feet it to mad instruction. A specific use case appears in one of our performance kernels: a SG should do more than one DPAS instruction to get optimal results. In most cases, especially when matrix sizes are large, the optimal size MxN is 32x64 on PVC, so instead of the user having to fully unroll 32x64 loop and then create multiple joint_matrix_mad operations, the implementation can provide such combination, document it in the document and in the query. In this case, the user will have one iteration in the SG to worry about.
Having all the combinations per backend (AMX, XMX8, XMX16, different SM versions for Nvidia) in the main document is fine, especially that the query interface is in the main document. So the combinations will complement the query API so the user knows what to expect when they use the query interface. |
- Put all combinations in appendix - move get_coord to the main document - Correct the example by converting USM pointers to multi_ptr
Yes the table here, https://github.com/intel/llvm/blob/e50a2f5f97acb12db1de78c9ad739b931c77b03f/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_cuda_matrix.asciidoc#valid-joint_matrix-types-and-shapes, is up to date with what the implementation is supporting. I can add it to the Appendix following what you did here in a subsequent PR. Or if you prefer to add it directly in this PR, feel free.
I see what you mean by logical vs hardware sizes. The initial sycl-blas commit on joint_matrix also has some relevance to your point I think: https://github.com/codeplaysoftware/sycl-blas . BTW the initial sycl-dnn joint_matrix accelerated commit will follow shortly (it is quite a bit larger so review takes a while). |
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
|
||
While this document presents the core API that unifies Intel AMX, | ||
Intel XMX, and Nvidia Tensor Cores, the implementations support | ||
slightly different versions of the API. For this reason, we introduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation is different now from what this paragraph states; because this document is specifically for the unified matrix interfaces which are portable. So I think it is best to replace this paragraph completely with e.g. the standard template for feature macro versioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JackAKirk, the standard template for feature macro versioning if not for experimental feature.
Once this moves out of experimental and becomes supported, this whole section "Matrix API versions" and SYCL_EXT_ONEAPI_MATRIX_VERSION macro will be removed. We won't need to keep the legacy API and tests. Right now, we only keep them to ensure current users have something working while we guide them through all these changes until we have something final (hopefully in this PR).
Do you suggest I remove this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think that anyone reading this now as documentation on joint_matrix
will be thrown by this (wrong) statement
"the implementations support
slightly different versions of the API"
And since this does seem to be the main place that people will arrive at for joint_matrix
documentation currently, it makes sense to address this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it especially that we made value 4 as the default already.
@@ -79,10 +132,13 @@ struct joint_matrix { | |||
} | |||
``` | |||
|
|||
IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator` | |||
IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only | |||
to matrix with `use::accumulator` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to matrix with `use::accumulator` | |
to `joint_matrix` with `use::accumulator` |
#### Use | ||
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers. | ||
==== Use | ||
Specifying the usage of the matrix: matrix left (A), matrix right (B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matrix left
and matrix right
are not defined (or equivalently A / B aren't defined).
} // namespace sycl::ext::oneapi::experimental::matrix | ||
``` | ||
This function copies `Rows x Cols` elements of type `T` from joint | ||
matrix `src` to joint matrix `dest`. The two matrcies must have the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matrix `src` to joint matrix `dest`. The two matrcies must have the | |
matrix `src` to joint matrix `dest`. The two matrices must have the |
the user whether a specific combination is valid or not. This takes | ||
place when the user specifies all template parameters. | ||
|
||
- Default values: this provides a default shape if the user does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the implementation choose default values with portability in mind? e.g return a value for XMX that matches the default value for AMX (If I remember correctly there is a unique case satisfying this?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no such case that satisfies XMX of DG2 and XMX of PVC. But it can be added. Currently the default is the max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl/ReleaseNotes.md changes look good to me.
Hi @dkhaldi. I just wanted to let you know that I have 4 unresolved comments above. I'm not pushing to resolve them faster, but I wanted to make sure you weren't waiting for me to do something. Some of the comments are hidden and you need to click "Load more" to see them. They are:
|
Hi @gmlueck, I just fixed the 4 unresolved comments and added clarifications for |
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
Outdated
Show resolved
Hide resolved
@intel/llvm-gatekeepers can you please merge? |
…ge signature of joint_matrix_mad (#11007) @gmlueck, in #7964, we explicitly deleted the copy ctor and assign op because we added `joint_matrix_copy` that is used to actually copy matrix data. However, when deleting them in implementation, failures occur as `joint_matrix_mad` uses them. This PR proposes to change the signature of `joint_matrix_mad` so these ctors are not used.
experimental namespace As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of intel#7964. This results in the following changes in the syntax: 1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU, Now, we make D as an input argument to mad. 2- Change “packed” to ext_intel_packed: 3- Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace) 4- add const to joint_matrix in store and mad 5 - add joint_matrix_copy/assignment function 6- add apply with coordination (change existing tests) 7- change get_coord vector type from int32_t to size_t 8- delete explicitly both = and copy ctor.
…ix from experimental namespace (#11215) As part of the effort to move joint matrix from experimental namespace to supported. A review of the API is being done as part of #7964. This results in the following changes in the syntax: 1- Add Td to joint_matrix_mad as Tc can be different from Td on the GPU, Now, we make D as an input argument to mad. 2- Change “packed” to ext_intel_packed: 3- Move EWOps (get_wi_data, wi_element, get_coord) to detail namespace) 4- add const to joint_matrix in store and mad 5 - add joint_matrix_copy/assignment function 6- add apply with coordination (change existing tests) 7- change get_coord vector type from int32_t to size_t 8- delete explicitly both = and copy ctor.