-
-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent GEMM Tuner Execution Times #428
Comments
The first thing I notice from your tuning results is that there seem to be two groups: those around 5ms, and those around 0.6ms. My guess is those resulting in 0.6ms are producing incorrect results.
How exactly did you do this? Perhaps you can share a |
Upon further investigation, the individual execution times I am seeing for the tuner's reported best-configuration (on an Intel KNL measured in my script) seem strange. I had been previously reporting the average of three executions, but decided to print out each individual execution. I performed 5 trials, each with an initial un-timed execution to allow for the compilation time following a call to OverrideParameters(...) followed by three measured iterations. The task dimensions were m=32, n=256, k=32.
Trial 1: Trial 2: Trial 3: Trial 4: Trial 5: In all trials but trial 3, the configuration reports an execution time below 1ms (although still noticeably larger than the 0.45ms the tuner reported), yet the variance of the execution times within each trial is huge, thus skewing the average that I had been collecting. The fact that some iterations are taking 20+ ms doesn't make sense to me, especially as it can't be localized to a specific iteration (trial 5 the 3rd iteration takes 22ms, whereas in trial 3 the 1st iteration takes 25ms). I shouldn't think this would be attributed to noise. Variation in execution times of this magnitude would render autotuning largely worthless. This excessive runtime variation across iterations of the same task and configuration is not unique to the task (m=32,n=256,k=32). I used the tuner to tune ~500 tasks and evaluated 729 tasks for m,n,k={8,16,32,64,128,256,512,1024,2048} and observed the same phenomenon for all of the smaller tasks. The larger tasks incurred little variation relative to its execution time, which makes sense. As an example, the smallest task (m=8,n=8,k=8) executed 7 times in succession yields the following 7 execution times. The last is ~20x larger than the 6th:
I cannot fathom what might be occurring that would cause this high variation. As far as I know, I am running with a single threaded process, so one core of KNL. I am not setting any thread-related environment variables. For clarity, single-threaded MKL achieves an average execution time (across 3 iterations for the same task dimensions (m=32,n=256,k=32) ) of 0.12ms, so about 3-4x smaller than that reported by CLBLAST's tuner found from its 54-configuration search (see initial post). You mentioned above My guess is those resulting in 0.6ms are producing incorrect results., yet CLBLAST's reported 0.45ms execution time seems reasonable when compared against MKL. Further, none of the tested configurations hit an assert in my script or failed to match the reference in the tuner. Thus I don't believe it is an issue of incorrectness caused by a line I modified. I have provided a subset of my script that yielded the above results, which admittedly was largely taken from the sample code for sgemm.cpp:
Other investigations: |
Thanks for sharing so much details, but did you see my message? My hunch is still that you incorrectly modified the tuner and are actually running with configurations that are invalid for your matrix sizes. Invalid configurations can cause undefined behaviour, which could explain everything (including the variation over runs). If that doesn't help, I would suggest to revert your changes and try the tuner with supported matrix sizes. Then again do your measurements and make sure everything is fine (expected speed, variance is gone, etc.). BTW, there is a reason these constraints are in place (even though you can work around them if you do it carefully): for smaller sizes the direct GEMM kernel might be faster to use anyway. |
Ok thanks. I switched to using XgemmDirect as the kernel. I rebuilt CLBLAST from scratch with no modifications to the source code on my part, but I am still seeing the following behavior:
This is my workflow following the cmake build:
This is the relevant part of my script:
As an example of issue 2) above, the following is the tuner's output for tuning task m=1024,n=1024,k=1024:
Yet, 4 successive calls (following a warmup as shown in code snippet above) for the same task and the same best parameters give the following execution times:
There is little variation among execution times, which makes sense considering it is a large task, but the average execution time is over 100ms, which is >5x larger than what the tuner reported for the same configuration. As an example of issue 1) above, I performed the same test for task m=64,n=64,k=64.
Yet, my script achieves the following four execution times:
There is significant variation among these four, and further the minimum is 1ms, which is >2x larger than that reported by the tuner for the same configuration. A further question I had:
|
Thanks again for sharing so much information. I would suggest to try to tackle issues one at a time. Let's start with the large one of m=n=k=1024, since that is a common situation and also large enough to hide any small overheads. Indeed, the time measured by the tuner should be close to the time that you measure with your script. There are a few things I would suggest to try first:
Then once that is solved let us continue with the other smaller kernels and the variation there. BTW, I see in your very first post above that the tuner reports 0.1 GFLOPS, which is probably nothing compared to your peak performance. So that already shows numbers are not to be trusted to much: any small overhead will contribute a lot for these situations. |
Ok. I had been tuning with the I have posted both the client's output and the tuner's output for the m=1024,n=1024,k=1024 task below. First, the tuner reports an execution time of 18.31ms, which is >2x smaller than the mean execution time achieved by the client across 20 iterations. My script is still achieving a mean execution time of ~110ms. So clearly there is a discrepancy between the tuner's output and the client's output. Further, as the min reported by the client is ~30ms and the max is ~66ms with a mean of ~40ms, that still seems like significant variation, especially compared against CBLAS.
|
I know you advised to focus on one issue at a time, but I tried a similar experiment with GEMV and the results point to what seems to be the same issue: the client and the tuner's outputs do not seem to match, and high variation is observed. The client's output below references the json file also below.
So, the current problem doesn't seem to be unique to GEMM, but rather to how CLBLAST operates with the installed OpenCL library? |
Hmm, OK. If the same behaviour also happens with GEMV, then that is even easier to test: no direct vs non-direct and shorter tuning. However, here we need to be careful as well: I believe that in case of 1024x1024 it will use the 'fast' GEMV kernel, and your tuning results that you share here are for the regular more generic GEMV kernel. See here in the code for the exact details. Perhaps the issue is related to POCL, so I also installed that on my system and used my CPU with POCL. Here are the results I get when running the client:
This 0.2 matches roughly what I get with the tuner as well. However there is indeed quite some variation. Keep in mind that this is on a laptop CPU which is already running a desktop environment. If I run the same with my GPU through the Intel OpenCL driver, I get a bit less variation but still some:
However, none of these seem as bad as your variations. I'll take some more time this weekend to have a more detailed look at the issue. |
Sorry for the late reply, I'm a bit busy these days. And it also has been a while since I digged into CLBlast. I think what you should do is set the It will also tell you the exact tuning parameters it used. For example for the in-direct version of GEMM, depending on your transpose/layout options it needs to run pre/post processing kernels. And it shows the full list of parameters used at this moment, and all the kernels that are running with their execution times:
or another (GEMV in this case) example running the
I believe that if you use this feature you should be able to get to the bottom of this: are you measuring the right thing? |
I also noticed tunner results not be very consistent. My guess is that the tests are way too short, so things like other load on the GPU (i.e. video playback), or boost frequency handling, can mess things up. Longer tests, and multiple repetitions should be used instead. Some of the tests are way too short, and in microseconds range, which is hard to time accurately and easily susceptible to noise. Example:
|
Can you set the And of course it is better to test on a machine that doesn't run anything else, e.g. a headless machine or in Linux without a graphical environment. |
Hi, I am trying to find the best set of 16 tuning parameters for a particular GEMM task: m=32,n=256,k=32, on an Intel KNL machine.
I have tuned GEMM for this particular task using a single process via the built-in tuners, and the first stage outputs the configuration that achieved the best performance. I manually disabled the three subsequent stages and modified
src/tuning/kernels/xgemm.hpp
to allow tuning problems with matrix dimensions less than 64.Here is that output:
However, when I run my script, also on a single process, which manually overrides the tuning parameters via the OverrideParameters(...) API call, the tuner's best configuration yields an execution time that is more than 10x slower than what was reported during the offline tuning. Specifically, these "best parameters' give an average execution time of 16ms across 5 iterations, and I do not time the initial iteration to allow for the compilation overhead as mentioned in the docs. I believe I call GEMM in the same way that the tuner calls it (RowMajor, NoTranspose for either input matrix, non-zero scalars alpha and beta).
I further investigated this by exhaustively searching over all 1000+ legal configurations within my script and found that the best configuration yielded an execution time similar to that found in the offline tuning stage (0.55 ms), but that was attained by a completely different configuration.
I have tried repeating this debugging process for other matrix dimensions beyond m=32,n=256,k=32 and have found similar discrepancies. I do not believe it is an error in the manual translation from json file to my script, as I have checked over it numerous times.
Can you advise what step, if any, I am missing? Perhaps I am making false assumptions about the offline built-in tuner that CLBLAST provides? I'd like to leverage the offline tuner for a number of tasks and then manually take the best-found configuration for each into my script via the OverrideParameters(...) routine.
The text was updated successfully, but these errors were encountered: