Stanford University logo SLAC National Accelerator Laboratory logo Los Alamos National Laboratory logo NVIDIA logo Winner of the R&D 100 Award

Legion

A Data-Centric Parallel Programming System

Github

Verification of Producer-Consumer Synchronization in GPU Programs

Update

Important: as of 2022, some of the links below are dead, and these instructions may no longer work. For the most up-to-date instructions, please go to the GitHub repo. This page is no longer being maintained.

Weft

This webpage describes the artifact for the following PLDI 2015 paper:

Verification of Producer-Consumer Synchronization in GPU Programs

The artifact (a CDE package) can be downloaded here. The artifact link is dead, please refer to the GitHub repo above.

Improvements

The version of Weft documented here is significantly improved from the original version presented in our PLDI paper. This version of Weft has the following modifications:

  • The functionality of PTXSim originally described in our paper has been subsumed entirely by Weft. This version of Weft parses an input PTX file and performs the emulation internally before performing the validation. Consequently, the process of translating from emulation to verification is significantly faster. The original files that would have been generated by PTXSim can still be generated using the -pflag.
  • The construction of the barrier dependence graph has been greatly optimized for higher performance.
  • The representation of happens-before/after relationships have been optimized for reduced memory usage.
  • Weft is now multi-threaded (see the -t flag) so that we can fully leverage the available memory bandwidth on modern processors which is crucial for achieving high performance.

As a result of our improvements, you may notice some differences between the version of PTXSim and Weft described in our original paper submission and the current source code. We assure you that this version of Weft has a super-set of the functionality of the original Weft and is anywhere from 2-30X faster than the original version depending on the input program and the machine architecture. We will document these improvements along with new performance results in the camera-ready version of our paper.

Instructions

Below we cover how to run Weft using our CDE package. The instructions for downloading, compiling, and running Weft from source are available at the github repository. Please consider reading the README to understand the advanced usage of WEFT.

This page provides instructions to reproduce the experiments described in the paper using the artifact. We assume a Linux based system, such as, Ubuntu.

First, unpack the archive

tar -zxvf cde-weft.tar.gz 

Next, go to the WEFT directory

cd cde-package/cde-root/home/sharmar/Desktop/artifact/Weft-master/ 

This folder, referred to as the main directory, contains two sub-directories:

  • The directory src contains the source code of WEFT with a precompiled binary weft.
  • The directory examples contains the examples described in the paper.

To run WEFT on the examples, run the following commands:

cd examples/
./run_examples.sh.cde

This shell file is just a wrapper around run_examples.sh which is the script we normally use to run all of our test examples. Alternatively, you can execute run_examples.sh directly from the terminal using the following command:

../../../../../../../cde-exec ./run_examples.sh

The ugly looking cde-exec prefix is important. It ensures that the CDE package performs as expected. For more details, please see the CDE user manual. The CDE user manual link is dead.

This shell file first compiles WEFT. Since the package contains a precompiled binary of WEFT, you should see the following output.

make: Nothing to be done for 'all'

Next, it generates the PTX file for the SAXPY kernels by compiling the CUDA files. Since, these files have been precompiled, you should again see

make: Nothing to be done for 'all'

Next WEFT analyzes the SAXPY kernels starting with saxpy_single.

Running saxpy single...
WEFT INFO: No deadlocks detected!
WEFT INFO: Barriers properly recycled!
WEFT INFO: No races detected!
WEFT STATISTICS for saxpy_single.ptx
  CTA Thread Count:                      320
  Shared Memory Locations:               512
  Physical Named Barriers;                 6
  Dynamic Barrier Instances:            8192
  Static Instructions:                   169
  Dynamic Instructions:              9575616
  Weft Statements:                   3670016
  Total Race Tests:               1878392832

The output says that WEFT has proven that the kernel is well-synchronized (deadlock free and recycles barriers properly) and race-free followed by some statistics (see paper). There are 320 threads in the CTA we are analyzing and these threads share 512 memory locations amongst them. This CTA uses 6 named barriers out of the 16 available. Since named barriers are a limited physical resource, they need to be reused and this CTA has 8192 distinct uses of the barriers. The PTX file (with statically analyzable loops) has 169 statements. After unrolling all loops the kernel has about 9 million instructions (across 320 threads). After translating these to the modeling language, WEFT analyzed about three million statements, that is, about hundred thousand statements per thread on average. WEFT proved that the 1.8 billion memory access pairs that can potentially race are safe (do not race).

The shell script continues to analyze all benchmarks described in the paper. Next, consider a kernel with a different output:

Running DME chemistry kepler...
WEFT WARNING: Program has shuffle instructions but warp-synchronous execution was not assumed!
Enabling warp-synchronous assumption...
WEFT INFO: No deadlocks detected!
WEFT INFO: Barriers properly recycled!
	Found races between 32 pairs of threads on line 3248 of chem_kepler.cu and line 3115 of chem_kepler.cu
...
WEFT INFO: Found 7168 total races!
           Run with '-d' flag to see detailed per-thread and per-address races
WEFT INFO: RACES DETECTED!
WEFT STATISTICS for chem_kepler.ptx
...

This kernel has shuffle instructions that require a warp-synchronous execution assumption. In this case, the shell script does not provide the -s flag which enables the warp-synchronous execution assumption. Therefore WEFT emits a warning and enables the warp-synchronous assumption. This kernel is well-synchronized, but it does have data races and WEFT shows the lines in the CUDA file that race. More information can be obtained by running the kernel with -d flag.

Compiling and Running Weft

If you want to compile WEFT for your machine (and not use the precompiled binary) then you can recompile the source using the following (from the main directory):

cd src
../../../../../../../cde-exec make clean
../../../../../../../cde-exec make

Similarly, if you want to regenerate the SAXPY PTX files from the CUDA files then you can do the following (again from the main directory):

cd examples/saxpy
../../../../../../../../cde-exec make clean
../../../../../../../../cde-exec make

The other examples can also be compiled similarly.

Weft can also be run on individual examples using the cde-exec command. We encourage you to use the ‘-i’ flag which instruments the execution of Weft and give detailed performance results including the timing and memory usage for each stage of execution. If you also have a multi-core machine we encourage you to explore the performance effects of variable number of threads using the ‘-t’ flag.

cd examples/saxpy
../../../../../../../../cde-exec ../../src/weft -i -t 4 -n 320 saxpy_single.ptx
WEFT INFO: No deadlocks detected!
WEFT INFO: Barriers properly recycled!
WEFT INFO: No races detected!
WEFT STATISTICS for saxpy_single.ptx
  CTA Thread Count:                      320
  Shared Memory Locations:               512
  Physical Named Barriers;                 6
  Dynamic Barrier Instances:            8192
  Static Instructions:                   177
  Dynamic Instructions:             11674560
  Weft Statements:                   3670016
  Total Race Tests:               1878392832
WEFT INSTRUMENTATION
                                   Parse PTX:      1.385 ms            0 MB
                             Emulate Threads:   3278.410 ms          267 MB
          Construct Barrier Dependence Graph:    838.145 ms           89 MB
  Compute Happens-Before/After Relationships:   1558.494 ms         3741 MB
                   Check for Race Conditions:   1978.978 ms            0 MB
                                       Total:   7655.412 ms         4099 MB

Authors

License

APACHE 2.0

Stanford University and NVIDIA