-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning for Apple M1 AMX2 coprocessor #3789
Comments
Same legal minefield as Dougall Johnson's reverse engineering work that was already suggested in JuliaLang/julia#2814 (and appears in the references/links document of corsix), and it does not appear as if anybody else (e.g. gcc) is basing M1 code on this currently either. So not really keen to touch this. |
Idk about the legal aspects of using undocumented upcodes. From my perspective as a layman, I would guess you are allowed to run any kind of code on a PC you bought. And I think in all open source licenses are legal disclaimer included which should release you from all liabilities if something goes south. Has anybody maybe the possibility to ask an expert on this legal question? |
You have Accelerate framework from Apple which uses secret co-processor. Not even xcode clang has any support for those instructions. I dont see even disassembler anywhere. |
The thing is that the Accelerate framework just provides LP64 BLAS/LAPACK and some software relies on an ILP64 implementation. E.g, Julia (see JuliaLang/LinearAlgebra.jl#869) uses per default OpenBLAS on all platforms but allows also to plug in other BLAS/LAPACK implementations as long as they provide ILP64 routines. This basically prevents the use of vecLib in Julia. On the other hand, if OpenBLAS would have the same performance/efficiency as vecLib, there would be no need to switch to the Accelerate framework in the first place - this was also my motivation for posting this idea here since it would be the most elegant way to resolve this issue IMHO (from a Julia perspective)... However, you seem doubtful regarding using the AMX extension directly and I understand your point of view from a legal perspective. Since you mentioned the Accelerate framework, here an alternative idea: Would it be possible (or reasonable, Idk the structure and paradigms of OpenBLAS) to simply write a wrapper to call those few BLAS level 3 routines which strongly benefit from AMX code directly from vecLib BLAS, preserving the ILP64 interface and all other OpenBLAS conventions? |
Currently AMX co-processor is undocumented and uninstrumented. It is quite legal to tinker, though nobody prevents OEM to microcode your toys away. |
Sure, the vecLib BLAS wrapper solution would accelerate at least matrix multiplications. But, if I understood correctly, OpenBLAS uses internally optimized kernels to accelerate things such as some LAPACK routines and other operations. Obviously, by using BLAS wrappers one would not have the benefit there. Additionally, the BLAS version/behavior might differ in the end as the vecLib and openBLAS implementation might diverge at some point. Thus, I think, in the end, the only 'clean' solution would be to have those optimized kernels in OpenBLAS which benefit from Apple's AMX extension... Recently, I stumbled over Apple's SIMD API which is also part of the accelerate framework: https://developer.apple.com/documentation/accelerate/simd. |
Actually dubiously named co-procesor is not mentioned in that abstraction API anywhere. Or by Apple. Though you can try yourself if 4x4 * 4x1 generates any non-disassemblable functions or calls deep in accelerate libraries. |
I know that there is already a similar discussion for the general tuning on Apple M1 chips (see #2814) but I wanted to revive this topic a little bit (focusing on Apple's AMX2 extension). I guess, regarding the M1 ARMv8 support, OpenBLAS already works pretty well and I think you did a very good job on that!
Nevertheless, we know by now that Apple's AMX2 extension (e.g. through vecLib) allows probably an even higher performance by simultaneously achieving a dramatically improved efficiency. If I understood correctly, so far, the main problem for adopting AMX2 instructions has been the missing documentation from Apple. However, even without the help of Apple, it seems that in the last two years those instructions have been decrypted (and unofficially documented) nearly completely (at least to my understanding, e.g., see https://github.com/corsix/amx#readme).
Thus, my question: Having this knowledge now, would it be possible to implement appropriate (Apple M1) AMX2 kernels in OpenBLAS so that we can achieve the same or an even better performance/efficiency than vecLib?
The text was updated successfully, but these errors were encountered: