Publications
Table of Contents
- Legion Runtime:
- Programming Model:
- Realm:
- Regent:
- Bindings:
- Libraries and Techniques:
- Visualization (ISAV 2017) [PDF]
- Graph Processesing (VLDB 2018) [PDF, Software Release]
- Legate NumPy (SC 2019) [PDF]
- Tensor Algebra (PLDI 2022) [PDF]
- Distributed Task Fusion (PAW-ATM 2022) [PDF]
- Sparse Tensor Algebra (SC 2022) [PDF]
- Legate Sparse (SC 2023) [PDF]
- AutoMap (SC 2023) [PDF]
- Task and Kernel Fusion (ASPLOS 2025) [PDF]
- Applications:
- DSLs:
- Theses:
Papers
Legion: Expressing Locality and Independence with Logical Regions PDF
Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2012)
Abstract: Modern parallel architectures have both heterogeneous processors and deep, complex
memory hierarchies. We present Legion, a programming model and runtime system
for achieving high performance on these machines. Legion is
organized around logical regions, which express both locality and
independence of program data, and tasks, functions that perform
computations on regions. We describe a runtime system that
dynamically extracts parallelism from Legion programs, using
a distributed, parallel scheduling algorithm that identifies both
independent tasks and nested parallelism. Legion also enables
explicit, programmer controlled movement of data through the
memory hierarchy and placement of tasks based on locality
information via a novel mapping interface. We evaluate our
Legion implementation on three applications: fluid-flow on a
regular grid, a three-level AMR code solving a heat diffusion
equation, and a circuit simulation.
Language Support for Dynamic, Hierarchical Data Partitioning PDF
Sean Treichler, Michael Bauer, Alex Aiken
In Object Oriented Programming, Systems, Languages,
and Applications (OOPSLA 2013)
Abstract: Applications written for distributed-memory parallel architectures must
partition their data to enable parallel execution. As memory hierarchies become deeper,
it is increasingly necessary that the data partitioning also be hierarchical to match.
Current language proposals perform this hierarchical partitioning statically, which excludes
many important applications where the appropriate partitioning is itself data
dependent and so must be computed dynamically. We describe Legion, a region-based
programming system, where each region may be partitioned into subregions. Partitions
are computed dynamically and are fully programmable. The division of data need not be
disjoint and subregions of a region may overlap, or alias one another. Computations use
regions with certain privileges (e.g., expressing that a computation uses a region
read-only) and data coherence (e.g., expressing that the computation need only be atomic with
respect to other operations on the region), which can be controlled on a per-region
(or subregion) basis.
We present the novel aspects of the Legion design, in particular the combination of static and dynamic checks used to enforce soundness. We give an extended example illustrating how Legion can express computations with dynamically determined relationships between computations and data partitions. We prove the soundness of Legion’s type system, and show Legion type checking improves performance by up to 71% by eliding provably safe memory checks. In particular, we show that the dynamic checks to detect aliasing at runtime at the region granularity have negligible overhead. We report results for three real-world applications running on distributed memory machines, achieving up to 62.5X speedup on 96 GPUs on the Keeneland supercomputer.
Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures PDF
Sean Treichler, Michael Bauer, Alex Aiken
In Parallel Architectures and Compilation Techniques (PACT 2014)
Abstract: We present Realm, an event-based runtime system for heterogeneous,
distributed memory machines. Realm is fully asynchronous: all runtime actions
are non-blocking. Realm supports spawning computations, moving data, and
reservations, a novel synchronization primitive. Asynchrony is exposed
via a light-weight event system capable of operating without central management.
We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Micro-benchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.
Structure Slicing: Extending Logical Regions with Fields PDF
Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014)
Abstract: Applications on modern supercomputers are increasingly limited by the
cost of data movement, but mainstream programming systems have few abstractions for
describing the structure of a program’s data. Consequently, the burden of managing
data movement, placement, and layout currently falls primarily upon the programmer
To address this problem, we previously proposed a data model based on logical regions and described Legion, a programming system incorporating logical regions. In this paper, we present structure slicing, which incorporates fields into the logical region data model. We show that structure slicing enables Legion to automatically infer task parallelism from field non-interference, decouple the specification of data usage from layout, and reduce the overall amount of data moved. We demonstrate that structure slicing enables both strong and weak scaling of three Legion applications, including S3D, a production combustion simulation that uses logical regions and thousands of fields, with speedups of up to 3.68X over a vectorized CPU-only Fortran implementation and 1.88X over an independently hand-tuned OpenACC code.
Note: The following paper is a result of our collaboration with the ExaCT
Combustion Co-Design Center and shows how a DSL
compiler can be used to generate fast tasks for Legion applications.
Singe: Leveraging Warp Specialization for High Performance on GPUs PDF
Michael Bauer, Sean Treichler, Alex Aiken
In Principles and Practices of Parallel Programming (PPoPP 2014)
Abstract: We present Singe, a Domain Specific Language (DSL) compiler for combustion
chemistry that leverages warp specialization to produce high performance code for GPUs.
Instead of relying on traditional GPU programming models that emphasize data-parallel
computations, warp specialization allows compilers like Singe to partition computations
into sub-computations, which are then assigned to different warps within a thread block.
Fine-grain synchronization between warps is performed efficiently in hardware using
producer-consumer named barriers. Partitioning computations using warp specialization
allows Singe to deal efficiently with the irregularity in both data access patterns
and computation. Furthermore, warp-specialized partitioning of computations allows
Singe to fit extremely large working sets into on-chip memories. Finally, we describe
the architecture and general computation techniques necessary for constructing a
warp-specializing compiler. We show that the warp-specialized code emitted by Singe
is up to 3.75X faster than previously optimized data-parallel GPU kernels.
Note: The following paper is an example of a DSL
compiler toolchain that targets Legion as a backend.
Exploring the Construction of a Domain-Aware Toolchain for High-Performance Computing PDF
Patrick McCormick, Christine Sweeney, Nick Moss, Dean Prichard,
Samuel K. Gutierrez, Kei Davis, Jamaludin Mohd-Yusof
In the International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC 2014)
Abstract: The push towards exascale computing has sparked
a new set of explorations for providing new productive programming
environments. While many efforts are focusing on
the design and development of domain-specific languages (DSLs),
few have addressed the need for providing a fully domain-aware
toolchain. Without such domain awareness critical features for
achieving acceptance and adoption, such as debugger support,
pose a long-term risk to the overall success of the DSL approach.
In this paper we explore the use of language extensions to
design and implement the Scout DSL and a supporting toolchain
infrastructure. We highlight how language features and the
software design methodologies used within the toolchain play
a significant role in providing a suitable environment for DSL
development.
Regent: A High-Productivity Programming Language for HPC with Logical Regions PDF
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015)
Abstract: We present Regent, a high-productivity programming language for high
performance computing with logical regions. Regent users compose
programs with tasks (functions eligible for parallel execution) and
logical regions (hierarchical collections of structured
objects). Regent programs appear to execute sequentially, require no
explicit synchronization, and are trivially deadlock-free. Regent’s
type system catches many common classes of mistakes and guarantees
that a program with correct serial execution produces
identical results on parallel and distributed machines.
We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.
Dependent Partitioning PDF
Sean Treichler, Michael Bauer, Rahul Sharma, Elliott Slaughter, and Alex Aiken
In Object Oriented Programming, Systems, Languages,
and Applications (OOPSLA 2016)
Abstract: A key problem in parallel programming is how data is
partitioned: divided into subsets that can be operated on in
parallel and, in distributed memory machines, spread across multiple
address spaces.
We present a dependent partitioning framework that allows an application to concisely describe relationships between partitions. Applications first establish independent partitions, which may contain arbitrary subsets of application data, permitting the expression of arbitrary application-specific data distributions. Dependent partitions are then derived from these using the dependent partitioning operations provided by the framework. By directly capturing inter-partition relationships, our framework can soundly and precisely reason about programs to perform important program analyses crucial to ensuring correctness and achieving good performance. As an example of the reasoning made possible, we present a static analysis that discharges most consistency checks on partitioned data during compilation.
We describe an implementation of our framework within Regent, a language designed for the Legion programming model. The use of dependent partitioning constructs results in a 86-96% decrease in the lines of code required to describe the partitioning, the elimination of many of the expensive dynamic checks required for soundness by the current Regent partitioning implementation, and speeds up the computation of partitions by 2.6-12.7X even on a single thread. Furthermore, we show that a distributed implementation incorporated into the the Legion runtime system allows partitioning of data sets that are too large to fit on a single node and yields an additional 29X speedup of partitioning operations on 64 nodes.
Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions PDF
Elliott Slaughter, Wonchan Lee, Sean Treichler, Wen Zhang, Michael Bauer, Galen Shipman, Patrick McCormick and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2017)
Abstract: We present control replication, a technique for generating
high-performance and scalable SPMD code from implicitly parallel
programs. In contrast to traditional parallel programming models that
require the programmer to explicitly manage threads and the
communication and synchronization between them, implicitly parallel
programs have sequential execution semantics and by their nature avoid
the pitfalls of explicitly parallel programming. However, without
optimizations to distribute control overhead, scalability is often
poor.
Performance on distributed-memory machines is especially sensitive to communication and synchronization in the program, and thus optimizations for these machines require an intimate understanding of a program’s memory accesses. Control replication achieves particularly effective and predictable results by leveraging language support for first-class data partitioning in the source programming model. We evaluate an implementation of control replication for Regent and show that it achieves up to 99% parallel efficiency at 1024 nodes with absolute performance comparable to hand-written MPI(+X) codes.
Integrating External Resources with a Task-Based Programming Model PDF
Zhihao Jia, Sean Treichler, Galen Shipman, Michael Bauer, Noah Watkins, Carlos Maltzahn, Patrick McCormick and Alex Aiken
In the International Conference on High Performance Computing, Data, and Analytics (HiPC 2017)
Abstract: Accessing external resources (e.g., loading input
data, checkpointing snapshots, and out-of-core processing) can
have a significant impact on the performance of applications. However, no existing programming systems for high-performance computing directly manage and optimize external
accesses. As a result, users must explicitly manage external
accesses alongside their computation at the application level,
which can result in both correctness and performance issues.
We address this limitation by introducing Iris, a task-based programming model with semantics for external resources. Iris allows applications to describe their access requirements to external resources and the relationship of those accesses to the computation. Iris incorporates external I/O into a deferred execution model, reschedules external I/O to overlap I/O with computation, and reduces external I/O when possible. We evaluate Iris on three microbenchmarks representative of important workloads in HPC and a full combustion simulation, S3D. We demonstrate that the Iris implementation of S3D reduces the external I/O overhead by up to 20x, compared to the Legion and the Fortran implementations.
In Situ Visualization with Task-based Parallelism PDF
Alan Heirich, Elliott Slaughter, Manolis Papadakis, Wonchan Lee, Tim Biedert and Alex Aiken
In the Workshop on In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization (ISAV 2017)
Abstract: This short paper describes an experimental prototype of in situ
visualization in a task-based parallel programming framework. A
set of reusable visualization tasks were composed with an existing
simulation. The visualization tasks include a local OpenGL renderer, a parallel image compositor, and a display task. These tasks
were added to an existing fluid-particle-radiation simulation and
weak scaling tests were run on up to 512 nodes of the Piz Daint
supercomputer. Benchmarks showed that the visualization components scaled and did not reduce the simulation throughput. The
compositor latency increased logarithmically with increasing node
count.
S3D-Legion: An Exascale Software for Direct Numerical Simulation of Turbulent Combustion with Complex Multicomponent Chemistry PDF
Sean Treichler, Michael Bauer, Ankit Bhagatwala, Giulio Borghesi, Ramanan Sankaran, Hemanth Kolla, Patrick S. McCormick, Elliott Slaughter, Wonchan Lee, Alex Aiken and Jacqueline Chen
In Exascale Scientific Applications: Scalability and Performance Portability (CRC Press, 2017)
A Distributed Multi-GPU System for Fast Graph Processing PDF, Software Release
Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez and Alex Aiken
In the International Conference on Very Large Data Bases (VLDB 2018)
Abstract: We present Lux, a distributed multi-GPU system that
achieves fast graph processing by exploiting the aggregate memory
bandwidth of multiple GPUs and taking advantage of locality in the
memory hierarchy of multi-GPU clusters. Lux provides two execution
models that optimize algorithmic efficiency and enable important GPU
optimizations, respectively. Lux also uses a novel dynamic load
balancing strategy that is cheap and achieves good load balance across
GPUs. In addition, we present a performance model that quantitatively
predicts the execution times and automatically selects the runtime
configurations for Lux applications. Experiments show that Lux
achieves up to 20× speedup over state-of-the-art shared memory systems
and up to two orders of magnitude speedup over distributed systems.
Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes PDF
Wonchan Lee, Elliott Slaughter, Michael Bauer, Sean Treichler, Todd Warszawski, Michael Garland and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2018)
Abstract: Many recent programming systems for both supercomputing
and data center workloads generate task graphs to express computations
that run on parallel and distributed machines. Due to the overhead
associated with constructing these graphs the dependence analysis that
generates them is often statically computed and memoized, and the
resulting graph executed repeatedly at runtime. However, many
applications require a dynamic dependence analysis due to data
dependent behavior, but there are new challenges in capturing and
re-executing task graphs at runtime. In this work, we introduce
dynamic tracing, a technique to capture a dynamic dependence analysis
of a trace that generates a task graph, and replay it. We show that an
implementation of dynamic tracing improves strong scaling by an
average of 4.9X and up to 7.0X on a suite of already optimized
benchmarks.
Correctness of Dynamic Dependence Analysis for Implicitly Parallel Tasking Systems PDF
Wonchan Lee, George Stelle, Patrick McCormick and Alex Aiken
In the International Workshop on Software Correctness for HPC Applications (Correctness 2018)
Abstract: In this paper, we rigorously verify the correctness of
dynamic dependence analysis, a key algorithm for parallelizing
programs in implicitly parallel tasking systems. A dynamic dependence
analysis of a program results in a task graph, a DAG of tasks
constraining the order of task execution. Because a program is
automatically parallelized based on its task graph, the analysis
algorithm must generate a graph with all the dependencies that are
necessary to preserve the program’s original semantics for any
non-deterministic parallel execution of tasks. However, this
correctness is not straightforward to verify as implicitly parallel
tasking systems often use an optimized dependence analysis
algorithm. To study the correctness of dynamic dependence analysis in
a realistic setting, we design a model algorithm that captures the
essence of realistic analysis algorithms. We prove that this algorithm
constructs task graphs that soundly and completely express correct
parallel executions of programs. We also show that the generated task
graph is the most succinct one for a program when the program
satisfies certain conditions.
Soleil-X: Turbulence, Particles, and Radiation in the Regent Programming Language PDF
Hilario Torres, Manolis Papadakis, Lluís Jofre, Wonchan Lee, Alex Aiken and Gianluca Iaccarino
In Bulletin of the American Physical Society (2018)
Abstract: The Predictive Science Academic Alliance Program (PSAAP)
II at Stanford University is developing an Exascale-ready
multi-physics solver to investigate particle-laden turbulent flows in
a radiation environment for solar energy receiver applications. To
simulate the proposed concentrated particle-based receiver design
three distinct but coupled physical phenomena must be modeled: fluid
flows, Lagrangian particle dynamics, and the transport of thermal
radiation. Therefore, three different physics solvers (fluid,
particles, and radiation) must run concurrently with significant
cross-communication in an integrated multi-physics
simulation. However, each solver uses substantially different
algorithms and data access patterns. Coordinating the overall data
communication, computational load balancing, and scaling these
different physics solvers together on modern massively parallel,
heterogeneous high performance computing systems presents several
major challenges. We have adopted the Legion programming system, via
the Regent programming language, and its task parallel programming
model to address these challenges. Our multi-physics solver Soleil-X
is written entirely in the high level Regent programming language and
is one of the largest and most complex applications written in Regent
to date. At this workshop we will give an overview of the software
architecture of Soleil-X as well as discuss how our multi-physics
solver was designed to use the task parallel programming model
provided by Legion. We will also discuss the developmentexperience,
scaling, performance, portability, and multi-physics simulation
results.
Legate NumPy: Accelerated and Distributed Array Computing PDF
Michael Bauer and Michael Garland
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019)
Abstract: NumPy is a popular Python library used for performing
array-based numerical computations. The canonical implementation of
NumPy used by most programmers runs on a single CPU core and only a
few operations are parallelized across cores. This restriction to
single-node CPU-only execution limits both the size of data that can
be processed and the speed with which problems can be solved. In this
paper we introduce Legate, a programming system that transparently
accelerates and distributes NumPy programs to machines of any scale
and capability typically by changing a single module import
statement. Legate achieves this by translating the NumPy application
interface into the Legion programming model and leveraging the
performance and scalability of the Legion runtime. We demonstrate that
Legate can achieve state-of-the-art scalability when running NumPy
programs on machines with up to 1280 CPU cores and 256 GPUs, allowing
users to prototype on their desktop and immediately scale up to
significantly larger machines. Furthermore, we demonstrate that Legate
can achieve between one and two orders of magnitude better performance
than the popular Python library Dask when running comparable programs
at scale.
A Constraint-Based Approach to Automatic Data Partitioning for Distributed Memory Execution PDF
Wonchan Lee, Manolis Papadakis, Elliott Slaughter and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019)
Abstract: Although data partitioning is required to enable
parallelism on distributed memory systems, data partitions are not
first class objects in most distributed programming models. As a
result, automatic parallelizers and application writers encode a
particular partitioning strategy in the parallelized program, leading
to a program not easily configured or composed with other parallel
programs.
We present a constraint-based approach to automatic data partitioning. By introducing abstractions for first-class data partitions, we express a space of correct partitioning strategies. Candidate partitions are characterized by partitioning constraints, which can be automatically inferred from data accesses in parallelizable loops. Constraints can be satisfied by synthesized partitioning code or user-provided partitions. We demonstrate that programs auto-parallelized in our approach are easily composed with manually parallelized parts and have scalability comparable to hand-optimized counterparts.
Pygion: Flexible, Scalable Task-Based Parallelism with Python PDF
Elliott Slaughter and Alex Aiken
In the Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM 2019)
Abstract: Dynamic languages provide the flexibility needed to
implement expressive support for task-based parallel programming
constructs. We present Pygion, a Python interface for the Legion
task-based programming system, and show that it can provide features
comparable to Regent, a statically typed programming language with
dedicated support for the Legion programming model. Furthermore, we
show that the dynamic nature of Python permits the implementation of
several key optimizations (index launches, futures, mapping) currently
implemented in the Regent compiler. Together these features enable
Pygion code that is comparable in expressiveness but more flexible
than Regent, and substantially more concise, less error prone, and
easier to use than C++ Legion code. Pygion is designed to interoperate
with Regent and can use Regent to generate high-performance CPU and
GPU kernel implementations. We show that, in combination with
high-performance kernels written in Regent, Pygion is able to achieve
efficient, scalable execution on up to 512 nodes of the heterogeneous
supercomputer Piz Daint.
HTR Solver: An Open-Source Exascale-Oriented Task-Based Multi-GPU High-Order Code for Hypersonic Aerothermodynamics PDF
Mario Di Renzo, Lin Fu and Javier Urzay
In Computer Physics Communications (2020)
Abstract: In this study, the open-source Hypersonics Task-based
Research (HTR) solver for hypersonic aerothermodynamics is
described. The physical formulation of the code includes
thermochemical effects induced by high temperatures (vibrational
excitation and chemical dissociation). The HTR solver uses high-order
TENO-based spatial discretization on structured grids and efficient
time integrators for stiff systems, is highly scalable in GPU-based
supercomputers as a result of its implementation in the Regent/Legion
stack, and is designed for direct numerical simulations of canonical
hypersonic flows at high Reynolds numbers. The performance of the HTR
solver is tested with benchmark cases including inviscid vortex
advection, low- and high-speed laminar boundary layers, inviscid
one-dimensional compressible flows in shock tubes, supersonic
turbulent channel flows, and hypersonic transitional boundary layers
of both calorically perfect gases and dissociating air.
Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance PDF
Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, Nicolai Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Qinglei Cao, George Bosilca, Seema Mirchandaney, Wonchan Lee, Sean Treichler, Patrick McCormick, and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2020)
Abstract: We present Task Bench, a parameterized benchmark designed to
explore the performance of distributed programming systems under a
variety of application scenarios. Task Bench dramatically lowers the
barrier to benchmarking and comparing multiple programming systems by
making the implementation for a given system orthogonal to the
benchmarks themselves: every benchmark constructed with Task Bench
runs on every Task Bench implementation. Furthermore, Task Bench’s
parameterization enables a wide variety of benchmark scenarios that
distill the key characteristics of larger applications.
To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100μs-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system’s scalability, ability to hide communication and mitigate load imbalance.
An Implicitly Parallel Meshfree Solver in Regent PDF
Rupanshu Soi, Nischay Ram Mamidi, Elliott Slaughter, Kumar Prasun, Anil Nemili, and S.M. Deshpande
In the Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM 2020)
Abstract: This paper presents the development of a Regent
based implicitly parallel meshfree solver for inviscid compressible
fluid flows. The meshfree solver is based on the Least Squares
Kinetic Upwind Method (LSKUM). The performance of the
Regent parallel solver is assessed by comparing with the explicitly
parallel versions of the same solver written in Fortran 90 and
Julia. The Fortran code uses MPI with PETSc libraries, while the
Julia code uses an MPI + X alternative parallel library. Numerical
results are shown to assess the performance of these solvers on
single and multiple CPU nodes.
Scaling Implicit Parallelism via Dynamic Control Replication PDF
Michael Bauer, Wonchan Lee, Elliott Slaughter, Zhihao Jia, Mario Di Renzo, Manolis Papadakis, Galen Shipman, Patrick McCormick, Michael Garland, and Alex Aiken
In Principles and Practices of Parallel Programming (PPoPP 2021)
Abstract: We present dynamic control replication, a run-time
program analysis that enables scalable execution of implicitly
parallel programs on large machines through a distributed and
efficient dynamic dependence analysis. Dynamic control replication
distributes dependence analysis by executing multiple copies of an
implicitly parallel program while ensuring that they still
collectively behave as a single execution. By distributing and
parallelizing the dependence analysis, dynamic control replication
supports efficient, on-the-fly computation of dependences for programs
with arbitrary control flow at scale. We describe an asymptotically
scalable algorithm for implementing dynamic control replication that
maintains the sequential semantics of implicitly parallel programs.
An implementation of dynamic control replication in the Legion runtime delivers the same programmer productivity as writing in other implicitly parallel programming models, such as Dask or TensorFlow, while providing better performance (11.4X and 14.9X respectively in our experiments), and scalability to hundreds of nodes. We also show that dynamic control replication provides good absolute performance and scaling for HPC applications, competitive in many cases with explicitly parallel programming systems.
Index Launches: Scalable, Flexible Representation of Parallel Task Groups PDF
Rupanshu Soi, Michael Bauer, Sean Treichler, Manolis Papadakis, Wonchan Lee, Patrick McCormick, Alex Aiken, and Elliott Slaughter
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2021)
Abstract: It’s common to see specialized language constructs in
modern task-based programming systems for reasoning about groups of
independent tasks intended for parallel execution. However, most
systems use an ad-hoc representation that limits expressiveness and
often overfits for a given application domain. We introduce index
launches, a scalable and flexible representation of a group of tasks.
Index launches use a flexible mechanism to indicate the data required
for a given task, allowing them to be used for a much broader set of
use cases while maintaining an efficient representation. We present a
hybrid design for index launches, involving static and dynamic program
analyses, along with a characterization of how they’re used in Legion
and Regent, and show how they generalize constructs found in other
task-based systems. Finally, we present results of scaling
experiments which demonstrate that index launches are crucial for the
efficient distributed execution of several scientific codes in Regent.
DISTAL: The Distributed Tensor Algebra Compiler PDF
Rohan Yadav, Alex Aiken and Fredrik Kjolstad
In the Conference on Programming Language Design and Implementation (PLDI 2022)
Abstract: We introduce DISTAL, a compiler for dense tensor algebra
that targets modern distributed and heterogeneous systems. DISTAL lets
users independently describe how tensors and computation map onto
target machines through separate format and scheduling languages. The
combination of choices for data and computation distribution creates a
large design space that includes many algorithms from both the past
(e.g., Cannon’s algorithm) and the present (e.g., COSMA). DISTAL
compiles a tensor algebra domain specific language to a distributed
task-based runtime system and supports nodes with multi-core CPUs and
multiple GPUs. Code generated by is competitive with optimized codes
for matrix multiply on 256 nodes of the Lassen supercomputer and
outperforms existing systems by between 1.8x to 3.7x (with a 45.7x
outlier) on higher order tensor operations.
Task Fusion in Distributed Runtimes PDF
Shiv Sundram, Wonchan Lee and Alex Aiken
In the Parallel Applications Workshop, Alternatives to MPI+X (PAW-ATM 2022)
Abstract: We present distributed task fusion, a runtime
optimization for task-based runtimes operating on parallel and
heterogeneous systems. Distributed task fusion dynamically performs an
efficient buffering, analysis, and fusion of asynchronously-evaluated
distributed operations, reducing the overheads inherent to scheduling
distributed tasks in implicitly parallel frameworks and runtimes. We
identify the constraints under which distributed task fusion is
permissible and describe an implementation in Legate, a
domain-agnostic library for constructing portable and scalable
task-based libraries. We present performance results using cuNumeric,
a Legate library that enables scalable execution of NumPy pipelines on
parallel and heterogeneous systems. We realize speedups up to 1.5x
with task fusion enabled on up to 32 P100 GPUs, thus demonstrating
efficient execution of pipelines involving many successive finegrained
tasks. Finally, we discuss potential future work, including
complementary optimizations that could result in additional
performance improvements.
SpDISTAL: Compiling Distributed Sparse Tensor Computations PDF
Rohan Yadav, Alex Aiken and Fredrik Kjolstad
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2022)
Abstract: We introduce SpDISTAL, a compiler for sparse tensor
algebra that targets distributed systems. SpDISTAL combines separate
descriptions of tensor algebra expressions, sparse data structures,
data distribution, and computation distribution. Thus, it enables
distributed execution of sparse tensor algebra expressions with a wide
variety of sparse data structures and data distributions. SpDISTAL is
implemented as a C++ library that targets a distributed task-based
runtime system and can generate code for nodes with both multi-core
CPUs and multiple GPUs. SpDISTAL generates distributed code that
achieves performance competitive with hand-written distributed
functions for specific sparse tensor algebra expressions and that
outperforms general interpretation-based systems by one to two orders
of magnitude.
Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence PDF
Michael Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, Michael Garland and Alex Aiken
In Principles and Practices of Parallel Programming (PPoPP 2023)
Abstract: Implicitly parallel programming systems must solve
the joint problems of dependence analysis and coherence to ensure
apparently-sequential semantics for applications run on distributed
memory machines. Solving these problems in the presence of data-dependent
control flow and arbitrary aliasing is a challenge that most existing systems
eschew by compromising the expressivity of their programming models and/or
the performance of their implementations. We demonstrate a general class of
solutions to these problems via a reduction to the visibility problem from
computer graphics.
Legate Sparse: Distributed Sparse Computing in Python PDF
Rohan Yadav, Wonchan Lee, Melih Elibol, Taylor Lee Patti, Manolis Papadakis, Michael Garland, Alex Aiken, Fredrik Kjolstad and Michael Bauer
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2023)
Abstract: The sparse module of the popular SciPy Python library is
widely used across applications in scientific computing, data analysis
and machine learning. The standard implementation of SciPy is
restricted to a single CPU and cannot take advantage of modern
distributed and accelerated computing resources. We introduce Legate
Sparse, a system that transparently distributes and accelerates
unmodified sparse matrix-based SciPy programs across clusters of CPUs
and GPUs, and composes with cuNumeric, a distributed NumPy
library. Legate Sparse uses a combination of static and dynamic
techniques to efficiently compose independently written sparse and
dense array programming libraries, providing a unified Python
interface for distributed sparse and dense array computations. We show
that Legate Sparse is competitive with single-GPU libraries like CuPy
and achieves 65% of the performance of PETSc on up to 1280 CPU cores
and 192 GPUs of the Summit supercomputer, while offering the
productivity benefits of idiomatic SciPy and NumPy.
Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines PDF
Thiago S. F. X. Teixeira, Alexandra Henzinger, Rohan Yadav and Alex Aiken
In the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2023)
Abstract: In a parallel and distributed application, a mapping
is a selection of a processor for each computation or task and
memories for the data collections that each task accesses. Finding
high-performance mappings is challenging, particularly on
heterogeneous hardware with multiple choices for processors and
memories. We show that fast mappings are sensitive to the machine,
application, and input. Porting to a new machine, modifying the
application, or using a different input size may necessitate re-tuning
the mapping to maintain the best possible performance.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41× faster than custom, hand-written mappers.
Speaking Pygion: Experiences Writing an Exascale Single Particle Imaging Code PDF
Seema Mirchandaney, Alex Aiken, and Elliott Slaughter
In the Workshop on Asynchronous Many-Task Systems and Applications (WAMTA 2024)
Abstract: The goal of the SpiniFEL project was to write, from
scratch, a single particle imaging code for exascale
supercomputers. The original vision was to have two versions of the
code, one in MPI and one in Pygion, a Python-based interface to the
Legion task-based runtime. We describe the motivation for the project,
some of the programming challenges we encountered along the way, what
worked and what didn’t, and why only the Pygion code eventually
succeeded in running at scale.
Automatic Tracing in Task-Based Runtime Systems PDF
Rohan Yadav, Michael Bauer, David Broman, Michael Garland, Alex Aiken, and Fredrik Kjolstad
In the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2025)
Abstract: Implicitly parallel task-based runtime systems often perform
dynamic analysis to discover dependencies in and extract parallelism from
sequential programs. Dependence analysis becomes expensive as task granularity
drops below a threshold. Tracing techniques have been developed where programmers
annotate repeated program fragments (traces) issued by the application, and the
runtime system memoizes the dependence analysis for those fragments, greatly
reducing overhead when the fragments are executed again. However, manual trace
annotation can be brittle and not easily applicable to complex programs built
through the composition of independent components. We introduce Apophenia, a
system that automatically traces the dependence analysis of task-based runtime
systems, removing the burden of manual annotations from programmers and enabling
new and complex programs to be traced. Apophenia identifies traces dynamically
through a series of dynamic string analyses, which find repeated program
fragments in the stream of tasks issued to the runtime system. We show that
Apophenia is able to come between 0.92x–1.03x the performance of manually
traced programs, and is able to effectively trace previously untraced programs
to yield speedups of between 0.91x–2.82x on the Perlmutter and Eos supercomputers.
Composing Distributed Computations Through Task and Kernel Fusion PDF
Rohan Yadav, Shiv Sundram, Wonchan Lee, Michael Garland, Michael Bauer, Alex Aiken, and Fredrik Kjolstad
In the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2025)
Abstract: We introduce Diffuse, a system that dynamically performs task
and kernel fusion in distributed, task-based runtime systems. The key component
of Diffuse is an intermediate representation of distributed computation that
enables the necessary analyses for the fusion of distributed tasks to be performed
in a scalable manner. We pair task fusion with a JIT compiler to fuse together the
kernels within fused tasks. We show empirically that Diffuse’s intermediate
representation is general enough to be a target for two real-world, task-based
libraries (cuPyNumeric and Legate Sparse), letting Diffuse find optimization
opportunities across function and library boundaries. Diffuse accelerates
unmodified applications developed by composing task-based libraries by 1.86x
on average (geo-mean), and by between 0.93x–10.7x on up to 128 GPUs. Diffuse
also finds optimization opportunities missed by the original application
developers, enabling high-level Python programs to match or exceed the
performance of an explicitly parallel MPI library.
Theses
Note: The following thesis is a thorough guide to the Legion programming model and covers many implementation details that are not documented elsewhere.
Legion: Programming Distributed Heterogeneous Architectures with Logical Regions PDF
Michael Edward Bauer
December 2014
Abstract: This thesis covers the design and implementation of
Legion, a new programming model and runtime system for targeting
distributed heterogeneous machine architectures. Legion introduces
logical regions as a new abstraction for describing the structure and
usage of program data. We describe how logical regions provide a
mechanism for applications to express important properties of program
data, such as locality and independence, that are often ignored by
current programming systems. We also show how logical regions allow
programmers to scope the usage of program data by different
computations. The explicit nature of logical regions makes these
properties of programs manifest, allowing many of the challenging
burdens of parallel programming, including dependence analysis and
data movement, to be off-loaded from the programmer to the programming
system.
Logical regions also improve the programmability and portability of applications by decoupling the specification of a program from how it is mapped onto a target architecture. Logical regions abstractly describe sets of program data without requiring any specification regarding the placement or layout of data. To control decisions about the placement of computations and data, we introduce a novel mapping interface that gives an application programmatic control over mapping decisions at runtime. Different implementations of the mapper interface can be used to port applications to new architectures and to explore alternative mapping choices. Legion guarantees that the decisions made through the mapping interface are independent of the correctness of the program, thus facilitating easy porting and tuning of applications to new architectures with different performance characteristics.
Using the information provided by logical regions, an implementation of Legion can automatically extract parallelism, manage data movement, and infer synchronization. We describe the algorithms and data structures necessary for efficiently performing these operations. We further show how the Legion runtime can be generalized to operate as a distributed system, making it possible for Legion applications to scale well. As both applications and machines continue to become more complex, the ability of Legion to relieve application developers of many of the tedious responsibilities they currently face will become increasingly important.
To demonstrate the performance of Legion, we port a production combustion simulation, called S3D, to Legion. We describe how S3D is implemented within the Legion programming model as well as the different mapping strategies that are employed to tune S3D for runs on different architectures. Our performance results show that a version of S3D running on Legion is nearly three times as fast as comparable state-of-the-art versions of S3D when run at 8192 nodes on the number two supercomputer in the world.
Realm: Performance Portability through Composable
Asynchrony PDF
Sean Jeffrey Treichler
December 2016
Abstract: Modern supercomputers are growing increasingly
complicated. The laws of physics have forced processor counts into the
thousands or even millions, resulted in the creation of deep
distributed memory hierarchies, and encouraged the use of multiple
processor and memory types in the same system. Developing an
application that is able to fully utilize such a system is very
difficult. The development of an application that is able to run well
on more than one such system with current programming models is so
daunting that it is generally not even attempted.
The Legion project attempts to address these challenges by combining a traditional hierarchical application structure (i.e. tasks/functions calling other tasks/functions) with a hierarchical data model (logical regions, which may be partitioned into subregions), and introducing the concept of mapping, a process in which the tasks and regions of a machine-agnostic description are assigned to the processors and memories of a particular machine.
This dissertation focuses on Realm, the “low-level” runtime that manages the execution of a mapped Legion application. Realm is a fully asynchronous event-based runtime. Realm operations are deferred by the runtime, returning an event that triggers upon completion of the operation. These events may be used as preconditions for other operations, allowing arbitrary composition of asynchronous operations. The resulting operation graph naturally exposes the available parallelism in the application as well as opportunities for hiding the latency of any required communication. While asynchronous task launches and non-blocking data movement are fairly common in existing programming models, Realm makes all runtime operations asynchronous — this includes resource management, performance feedback, and even, apparently paradoxically, synchronization primitives.
Important design and implementation issues of Realm will be discussed, including the novel generational event data structure that allows Realm to efficiently and scalably handle a very large number of events in a distributed environment and the machine model that provides the information required for the mapping of a Legion application onto a system. Realm anticipates dynamic behavior of both future applications and future systems and includes mechanisms for application-directed profiling, fault reporting, and dynamic code generation that further improve performance portability by allowing an application to adapt to and optimize for the exact system configuration used for each run.
Microbenchmarks demonstrate the efficiency and scalability of the Realm and justify some of the non-obvious design decisions (e.g. unfairness in locks). Experiments with several mini-apps are used to measure the benefit of a fully asynchronous runtime compared to existing “non-blocking” approaches. Finally, performance of Legion applications at full-scale show how Realm’s composable asynchrony and support for heterogeneity benefit the overall Legion system on a variety of modern supercomputers.
Regent: A High-Productivity Programming Language for Implicit Parallelism with Logical Regions PDF
Elliott Slaughter
August 2017
Abstract: Modern supercomputers are dominated by distributed-memory
machines. State of the art high-performance scientific applications
targeting these machines are typically written in low-level,
explicitly parallel programming models that enable maximal performance
but expose the user to programming hazards such as data races and
deadlocks. Conversely, implicitly parallel models isolate the user
from these hazards by providing easy-to-use sequential semantics and
place responsibility for parallelism and data movement on the
system. However, traditional implementations of implicit parallelism
suffer from substantial limitations: static, compiler-based
implementations restrict the programming model to exclude dynamic
features needed for unstructured applications, while dynamic,
runtime-based approaches suffer from a sequential bottleneck that
limits the scalability of the system.
We present Regent, a programming language designed to enable a hybrid static and dynamic analysis of implicit parallelism. Regent programs are composed of tasks (functions with annotated data usage). Program data is stored in regions (hierarchical collections); regions are dynamic, first-class values, but are named statically in the type system to ensure correct usage and analyzability of programs. Tasks may execute in parallel when they are mutually independent as determined by the annotated usage (read, write, etc.) of regions passed as task arguments. A Regent implementation is responsible for automatically discovering parallelism in a Regent program by analyzing the executed tasks in program order.
A naive implementation of Regent would suffer from a sequential bottleneck as tasks must be analyzed sequentially at runtime to discover parallelism, limiting scalability. We present an optimizing compiler for Regent which transforms implicitly parallel programs into efficient explicitly parallel code. By analyzing the region arguments to tasks, the compiler is able to determine the data movement implied by the sequence of task calls, even in the presence of unstructured and data-dependent application behavior. The compiler can then replace the implied data movement with explicit communication and synchronization for efficient execution on distributed-memory machines. We measure the performance and scalability of several Regent programs on large supercomputers and demonstrate that optimized Regent programs perform comparably to manually optimized explicitly parallel programs.
A Hybrid Approach to Automatic Program Parallelization via Efficient Tasking with Composable Data Partitioning PDF
Wonchan Lee
December 2019
Abstract: Despite the decades of research, distributed programming
is still a painful task and programming systems designed to improve
productivity fall short in practice. Auto-parallelizing compilers
simplify distributed programming by parallelizing sequential programs
automatically for distributed execution. However, their applicability
is severely limited due to the fundamental undecidability of their
static analysis problem. Runtime systems for implicit parallelism can
handle a broader class of programs via an expressive programming
model, but their runtime overhead often becomes a performance
bottleneck. To design a practical system for productive distributed
programming, one must combine the strengths of different
parallelization paradigms to overcome their weaknesses when used in
isolation.
This dissertation presents a hybrid approach to automatic program parallelization, which combines an auto-parallelizing compiler with an implicitly parallel tasking system. Our approach parallelizes programs in two steps. First, the auto-parallelizer materializes data parallelism in a program into task parallelism. Next, the tasking system dynamically analyzes dependencies between tasks and executes independent tasks in parallel. This two-stage process gives programmers a second chance when the auto-parallelizer “fails”: When a part of a program is not amenable to the compiler auto-parallelization, the programmer can gracefully fall back to the runtime parallelization by writing that part directly with task parallelism. Furthermore, hand-written tasks can be seamlessly integrated with the auto-parallelized part via composable data partitioning enabled by our auto-parallelizer, which allows them to share the partitioning strategy and thereby avoid excessive communication.
Key to the success of this hybrid approach is to minimize the overhead of the tasking system. To achieve this goal, we introduce dynamic tracing, a runtime mechanism for efficient tasking. The most expensive component in the tasking system is dynamic dependence analysis. Although this dynamic analysis is necessary when applications exhibit true dynamic behavior, the analysis is redundant for common cases where dependencies are (mostly) unchanging. Dynamic tracing eliminates this redundancy in dynamic dependence analysis by recording the dependence analysis of an execution trace and then replaying the recording for the subsequent occurrences of the same trace. To guarantee that a recording of a trace correctly replaces the trace’s original analysis, dynamic tracing also records memory locations that hold valid data when it records a trace and replays the recording only when those locations are still valid. Dynamic tracing significantly improves the efficiency of tasking, and thereby brings the strong scalability of explicit parallelism to implicit task parallelism.
Automated Discovery of Machine Learning Optimizations PDF
Zhihao Jia
August 2020
Abstract: The increasing complexity of machine learning (ML)
models and ML-specific hardware architectures makes it increasingly
challenging to build efficient and scalable ML systems. Today’s ML
systems heavily rely on human effort to optimize the deployment of ML
models on modern hardware platforms, which requires a tremendous
amount of engineering effort but only provides suboptimal runtime
performance. Moreover, the rapid evolution of ML models and
ML-specific hardware makes it infeasible to manually optimize
performance for all model and hardware combinations.
In this dissertation, we propose a search-based methodology to build performant ML systems by automatically discovering performance optimizations for ML computations. Instead of only considering the limited set of manually designed performance optimizations in current ML systems, our approach introduces a significantly more comprehensive search space of possible strategies to optimize the deployment of an ML model on a hardware platform. In addition, we design efficient search algorithms to explore the search space and discover highly-optimized strategies. The search is guided by a cost model for evaluating the performance of different strategies. We also propose a number of techniques to accelerate the search procedure by leveraging the topology of the search space.
This dissertation presents three ML systems that apply this methodology to optimize different tasks in ML deployment. Compared to current ML systems relying on manually designed optimizations, our ML systems enable better runtime performance by automatically discovering novel performance optimizations that are missing in current ML systems. Moreover, the performance improvement is achieved with less engineering effort, since the code needed for discovering these optimizations is much less than manual implementation of these optimizations.
First, we developed TASO, the first ML graph optimizer that automatically generates graph optimizations. TASO formally verifies the correctness of the generated graph optimizations using an automated theorem prover, and uses cost-based backtracking search to discover how to apply the verified optimizations. In addition to improving runtime performance and reducing engineering effort, TASO also provides correctness guarantees using formal methods.
Second, to generalize and go beyond today’s manually designed parallelization strategies for distributed ML computations, we introduce the SOAP search space, which contains a comprehensive set of possible strategies to parallelize ML computations by identifying parallelization opportunities across different Samples, Operators, Attributes, and Parameters. We developed FlexFlow, a deep learning engine that automatically searches over strategies in the SOAP search space. FlexFlow includes a novel execution simulator to evaluate the runtime performance of different strategies, and uses a Markov Chain Monte Carlo (MCMC) search algorithm to find performant strategies. FlexFlow discovers strategies that significantly outperform existing strategies, while requiring no manual effort during the search procedure.
Finally, we developed Roc, which automates data placement optimizations and minimizes data transfers in the memory hierarchy for large-scale graph neural network (GNN) computations. Roc formulates the task of optimizing data placement as a cost minimization problem and uses a dynamic programming algorithm to discover a globally optimal data management plan that minimizes data transfers between memories.
Scaling Implicit Parallelism with Index Launches PDF
Rupanshu Soi
December 2021
Abstract: Task-based programming systems are now widely used to
program modern supercomputers. Since these systems need to achieve
efficient scalable execution while being suitable for a wide variety
of application domains, their core abstraction, the task graph, needs
to have a scalable but expressive representation.
We present index launches, a technique, embedded in a general task-based framework, to collapse a task graph to obtain a more efficient representation. Leveraging a hybrid program analysis, we show how a compiler can support index launches by providing safety guarantees in virtually every case of practical importance.
In addition to the support for forall-style parallelism, we extend index launches in two primary directions. First, we introduce cross products, a novel abstraction for nested data partitioning, and describe how to incorporate them in index launches. Second, to enable index launches to represent limited forms of pipeline parallelism, we present ordered launches, and develop the program analysis required for an efficient implementation.
Our implementation of index launches is in Regent, a high-productivity programming language for distributed computing. Using the Piz Daint supercomputer, we evaluate the performance of several scientific applications written in Regent, and show that index launches lead to improved performance at the scales of contemporary and upcoming high-performance machines.