Verification of Producer-Consumer Synchronization in GPU Programs
Update
Important: as of 2022, some of the links below are dead, and these instructions may no longer work. For the most up-to-date instructions, please go to the GitHub repo. This page is no longer being maintained.
Weft
This webpage describes the artifact for the following PLDI 2015 paper:
Verification of Producer-Consumer Synchronization in GPU Programs
The artifact (a CDE package) can be downloaded here. The artifact link is dead, please refer to the GitHub repo above.
Improvements
The version of Weft documented here is significantly improved from the original version presented in our PLDI paper. This version of Weft has the following modifications:
- The functionality of PTXSim originally described in our paper
has been subsumed entirely by Weft. This version of Weft parses
an input PTX file and performs the emulation internally before
performing the validation. Consequently, the process of translating
from emulation to verification is significantly faster. The original
files that would have been generated by PTXSim can still be
generated using the
-p
flag. - The construction of the barrier dependence graph has been greatly optimized for higher performance.
- The representation of happens-before/after relationships have been optimized for reduced memory usage.
- Weft is now multi-threaded (see the
-t
flag) so that we can fully leverage the available memory bandwidth on modern processors which is crucial for achieving high performance.
As a result of our improvements, you may notice some differences between the version of PTXSim and Weft described in our original paper submission and the current source code. We assure you that this version of Weft has a super-set of the functionality of the original Weft and is anywhere from 2-30X faster than the original version depending on the input program and the machine architecture. We will document these improvements along with new performance results in the camera-ready version of our paper.
Instructions
Below we cover how to run Weft using our CDE package. The instructions for downloading, compiling, and running Weft from source are available at the github repository. Please consider reading the README to understand the advanced usage of WEFT.
This page provides instructions to reproduce the experiments described in the paper using the artifact. We assume a Linux based system, such as, Ubuntu.
First, unpack the archive
Next, go to the WEFT directory
This folder, referred to as the main directory, contains two sub-directories:
- The directory
src
contains the source code of WEFT with a precompiled binary weft. - The directory
examples
contains the examples described in the paper.
To run WEFT on the examples, run the following commands:
This shell file is just a wrapper around run_examples.sh which is the script we normally use to run all of our test examples. Alternatively, you can execute run_examples.sh directly from the terminal using the following command:
The ugly looking cde-exec prefix is important. It ensures
that the CDE package performs as expected. For more details,
please see the CDE user manual. The CDE user manual
link is dead.
This shell file first compiles WEFT. Since the package contains a precompiled binary of WEFT, you should see the following output.
Next, it generates the PTX file for the SAXPY kernels by compiling the CUDA files. Since, these files have been precompiled, you should again see
Next WEFT analyzes the SAXPY kernels starting with saxpy_single.
The output says that WEFT has proven that the kernel is well-synchronized (deadlock free and recycles barriers properly) and race-free followed by some statistics (see paper). There are 320 threads in the CTA we are analyzing and these threads share 512 memory locations amongst them. This CTA uses 6 named barriers out of the 16 available. Since named barriers are a limited physical resource, they need to be reused and this CTA has 8192 distinct uses of the barriers. The PTX file (with statically analyzable loops) has 169 statements. After unrolling all loops the kernel has about 9 million instructions (across 320 threads). After translating these to the modeling language, WEFT analyzed about three million statements, that is, about hundred thousand statements per thread on average. WEFT proved that the 1.8 billion memory access pairs that can potentially race are safe (do not race).
The shell script continues to analyze all benchmarks described in the paper. Next, consider a kernel with a different output:
This kernel has shuffle instructions that require a warp-synchronous execution
assumption. In this case, the shell script does not provide the -s
flag which enables
the warp-synchronous execution assumption. Therefore WEFT emits a warning
and enables the warp-synchronous assumption. This kernel is well-synchronized,
but it does have data races and WEFT shows the lines in the CUDA file that race.
More information can be obtained by running the kernel with -d
flag.
Compiling and Running Weft
If you want to compile WEFT for your machine (and not use the precompiled binary) then you can recompile the source using the following (from the main directory):
Similarly, if you want to regenerate the SAXPY PTX files from the CUDA files then you can do the following (again from the main directory):
The other examples can also be compiled similarly.
Weft can also be run on individual examples using the cde-exec command. We encourage you to use the â-iâ flag which instruments the execution of Weft and give detailed performance results including the timing and memory usage for each stage of execution. If you also have a multi-core machine we encourage you to explore the performance effects of variable number of threads using the â-tâ flag.
Authors
License
Copyright
Stanford University and NVIDIA