Once SparseTIR was accepted to ASPLOS 2023, we began constructing the sparsetir-artifact repository for artifact evaluation. Though we already have lots of benchmarking scripts, we found it still not a trivial job to put them together and evaluate in a unified manner. While preparing our artifact, we also found some problems with our profiler and bugs in existing implementations. We carefully addressed these issues and standardized the settings for all baselines. We are writing this post to document the challenges we faced and the lessons we learned from creating the artifact. We aim to provide insight that will benefit researchers and engineers working in related fields.
If you previously read our manuscript on ArXiv, you may have noticed that there are some discrepancies in the reported performance between SparseTIRv3 and our camera-ready version in the ASPLOS proceedings. These differences are due to variations in profiling methodology and the use of different versions of dependent software, which we will elaborate on in detail.
The following code snippet is a common method for profiling CUDA kernels, taken from dgSPARSE:
struct GpuTimer {
cudaEvent_t startEvent;
cudaEvent_t stopEvent;
GpuTimer() {
cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);
}
~GpuTimer() {
cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);
}
void start() { cudaEventRecord(startEvent, 0); }
void stop() {
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
}
float elapsed_msecs() {
float elapsed;
cudaEventElapsedTime(&elapsed, startEvent, stopEvent);
return elapsed;
}
};
GpuTimer gpu_timer;
int warmup_iter = 10;
int repeat_iter = 100;
for (int iter = 0; iter < warmup_iter + repeat_iter; iter++) {
if (iter == warmup_iter) {
gpu_timer.start();
}
// your kernel here
f();
}
gpu_timer.stop();
float kernel_dur_msecs = gpu_timer.elapsed_msecs() / repeat_iter;
This code runs the kernels for a specified number of times (warmup_iter
) as a warm-up, starts a CUDAEvent
, runs the kernels for the specified number of repeats (repeat_iter
), stops the CUDAEvent
, and divides the elapsed time by the number of repeat iterations to obtain the average kernel duration. Other frameworks such PyTorch profiler/TVM's time_evaluator
function, works in similar way.
However, NVIDIA GPUs do not flush the L2 cache after each kernel call. Therefore, if we use the same inputs and outputs in the function f(), the data accessed in previous runs may still reside in the L2 cache, resulting in reduced memory latency in the next run and producing inaccurate measurements. Nevertheless, it is unlikely that such L2 reuse would be beneficial in real-world neural network training/inference scenarios, where we do not execute the same kernel on the same inputs/outputs multiple times. Moreover, the data accessed in one kernel would soon be flushed away by the next kernel in the computational graph.
The nvbench framework and Triton are aware of such effects and preclude the effect by flushing L2 cache before each kernel launch in their profilers.
We have corrected the profiling behavior of all libraries (e.g. Sputnik) and compilers in our C++ codebase, and incorporated them into our artifact. For operators defined in Python, we profile them through a unified interface. Please note that TVM's native profiler incurs external overhead when flush L2 is enabled, and therefore we have turned to use Triton's profiler for TVM kernels instead (see this PR, which is part of v1.3).
Based on our evaluations, the effect of L2 cache cannot be neglected, particularly for "small" kernels that access only a small amount of data, which is smaller than the L2 cache size of commercial GPUs (e.g., 6MB for V100). For relatively large kernels, the effect of L2 flush is not as significant. We were not previously aware of this issue, and in the sparsetir-artifact, we have enabled L2 flush for all single-kernel benchmarks.
There are performance gaps in SpMM, SDDMM, and GraphSAGE on the Reddit graph, which stem from updates to the DGL dependency. The sparse matrices used for evaluation are extracted from DGL's built-in datasets. In SparseTIRv3 or earlier, we used an older version of DGL that downloaded the original dataset from its source. In the sparsetir-artifact, we used DGL 0.9, which preprocesses the Reddit graph by applying the rcmk reordering. This preprocessing improves the locality of non-zero elements in the sparse matrices, resulting in almost all SPMM implementations running faster. Consequently, the speedup of SparseTIR-hyb over cuSPARSE decreased from 2.3x to 1.5x on the V100. Similarly, other compared libraries also degraded: Sputnik from 1.7x to 1.0x and dgSPARSE from 1.1x to 0.9x.
Other notable updates include:
- CUDA 11.6.1 -> CUDA 11.7.1
- PyTorch Geometric 2.0.4 -> PyTorch Geometric 2.2.0 (which integrates Cutlass and improves RGCN performance, see figure 20 for the changes).
We updated Block-Sparse SPMM for Sparse Attention by conducting parameter search, making it running faster on V100.
During our communication with the artifact reviewers, we identified some common issues that they encountered while reviewing our artifact. We realized that some of these issues were classic mistakes that could have been avoided during the preparation of our artifact. Here are some key takeaways from our discussions with the reviewers:
- Always use https instead of ssh connections when working with submodules in git repositories.
- Although using ssh connections to Github is a common practice among programmers, it requires some setup (i.e., copying the local public key to the Github account). However, we cannot assume that all users have done so (e.g., users who are using a new AWS instance), and this could lead to issues when cloning the repository. To avoid this, it's important to use https connections for all submodules in
.gitmodules
and instruct user to clone the repository using https connection.
- Although using ssh connections to Github is a common practice among programmers, it requires some setup (i.e., copying the local public key to the Github account). However, we cannot assume that all users have done so (e.g., users who are using a new AWS instance), and this could lead to issues when cloning the repository. To avoid this, it's important to use https connections for all submodules in
- Properly handle network connection issues.
- The download of datasets/benchmarks is not always successful due to network connection issues. If we don't handle these issues properly, the user might encounter unknown issues in later steps.
Conducting artifact evaluation is a good practice for system papers as it enhances reproducibility and helps identify evaluation issues. During the process of preparing our artifact, we encountered several issues related to profiling and benchmark setup, which we have properly addressed. All results presented in our camera-ready paper were obtained by building and running the artifact from scratch.
We encourage you to try our artifact, particularly the latest version 1.3, where we have fixed all profiler behavior issues, and use it for your future research.