Explicit Ghost Regions

In our circuit simulation and our stencil example we demonstrated how Legion allows for applications to describe ghost regions using multiple partitions. There are both benefits and costs to describing ghost regions in this way. The primary benefit is the ease of programmability of describing ghost regions as another partition in an existing region tree. Consequently an application only needs to enumerate a stream of different sub-tasks using different regions, and the Legion runtime will automatically determine the communication patterns based on region usage.

However, there is also a cost to writing applications in this way. If the sub-tasks being launched are very fine-grained, the overhead of having Legion dynamically compute the communication pattern can become a sequential bottleneck. For cases such as these, as well as for cases where communication patterns are fairly simple (e.g. nearest neighbors ghost region exchange), it is often beneficial to structure applications with explicit ghost regions. Fortunately, the abstractions presented in Legion are sufficiently powerful to express these kinds of applications.

The rest of this page walks through how to restructure the stencil example in a way that uses explicit ghost regions. The code for this example can be found in the examples/full_ghost/ directory in the Legion repository. We begin by describing the high-level approach to writing our stencil computation. We then describe the individual features necessary for accomplishing this. Finally, we describe how all of the features compose to build our stencil application.

Algorithm Overview

In our original stencil example we relied on the top-level task to iterate through multiple steps and launch sub-tasks for computing the stencil on each of the different sub-logical regions. In our explicit ghost cell stencil example, we will instead launch different sub-tasks for computing each sub-region of our stencil and allow these sub-tasks to iterate through the loop independently. We will create explicit ghost regions that will enable these independent tasks to communicate while running in parallel. Another way to think about this is we are effectively emulating an SPMD programming model in Legion. However, unlike existing SPMD programming models like MPI and GASNet, in this one, our programming system will still be aware of how data is shared and communicated through logical regions.

Creating Explicit Ghost Regions

The first step in writing our stencil application with explicit logical regions is the same as for the original stencil application: we create index trees and region trees. Our approach this time will be slightly different. We begin by creating an index tree that describes our entire space of points called is in the code. We first partition this index space into disjoint sub-spaces: one for each sub-task that we plan to launch. In the code, the resulting IndexPartition is called disjoint_ip. We then iterate over each of these sub-spaces and recursively partition them into two more sub-spaces to describe the left and right ghost spaces. It is important to note at this point we have not created any logical regions yet, but simply described the decomposition of the index space of points. The following image illustrate the resulting index space tree.

After setting up our index space tree, we now create the explicit ghost regions from the leaf index spaces of the tree. Note that we do not actually create the logical regions for the sub-spaces in disjoint_ip until each of the sub-tasks are launched from the top-level task. Each sub-task will create its own logical region from its sub-space in disjoint_ip. The explicit ghost logical regions are stored in the ghost_left and ghost_right vectors. Having created our explicit ghost cell regions we can now launch off sub-tasks for computing stencils for each of our different sub-sets of points. We have suggestively named these sub-tasks spmd_tasks. Each spmd_task instance requests READ-WRITE privileges on both its left and right ghost regions as well READ-ONLY privileges on the left and right ghost regions from its adjacent neighbors for a total of four logical region requirements. This will allow each spmd_task to read and write its ghost regions as well as to read its neighbor ghost regions. In Legion applications that rely on EXCLUSIVE coherence, many of these sub-tasks would have explicit data dependences. However, we use a relaxed coherence mode called SIMULTANEOUS coherence to allow these tasks to run in parallel. We describe relaxed coherence modes next.

Relaxed Coherence Modes

When a task issues a stream of sub-task launches to Legion, Legion goes about analyzing these sub-tasks for data dependences based on their region requirements in program order (e.g. the order they were issued to the Legion runtime). Normally, region requirements are annotated with EXCLUSIVE coherence, which tells Legion that if there is a dependence between two tasks, it must be obeyed in keeping with program order execution.

However, there are often cases where this is too restrictive of a constraint. In some applications, tasks might have a data dependence, but only need serializability and not explicit program order execution. In others, the application might not want Legion to enforce any ordering, and instead will handle its own synchronization to data in a common logical region. To support these cases, Legion provides two relaxed coherence modes: ATOMIC and SIMULTANEOUS. ATOMIC coherence allows Legion to re-order tasks as long as access to a particular logical region is guaranteed to be serializable. SIMULTANEOUS instructs Legion to ignore any data dependences on logical regions with the guarantee that the application will manage access to the shared logical regions using its own synchronization primitives.

For our stencil application, we use SIMULTANEOUS coherence to instruct Legion to ignore data dependences between tasks accessing the same explicit ghost regions with the promise that the application will coordinate synchronization between them (which we describe momentarily). The use of SIMULTANEOUS coherence allows tasks to run in parallel despite data dependences. This is useful if there may need to be synchronization between our spmd_task instances, but in our case, we know for sure that there will need to be both communication and synchronization. We therefore want a stronger guarantee that our sub-tasks will run in parallel. We describe how we accomplish this in the next section.

Must Parallelism Launchers

When using relaxed coherence modes, in some cases applications may simply be executing under may-parallelism conditions where it is acceptable for tasks to run in parallel, but with no expectations that tasks be able to synchronize. In other cases, tasks might be in a must-parallelism scenario, where they must run in parallel and be capable of synchronizing in order to avoid hanging. Our stencil application is one example of a must-parallelism application since we know that we are going to need to explicitly exchange data between the different ghost regions of our spmd_task instances. Under these conditions it is imperative that the application be able to express the requirement that tasks execute in parallel and synchronize with each other. Legion provides a special kind of task launcher called a MustEpochLauncher that make this possible.

A MustEpochLauncher is actually a meta-launcher that simply contains other launcher objects. The idea is that instead of launching a bunch of tasks to the Legion runtime separately and hoping they run in parallel, applications can gather up a bunch of tasks (either a collection of individual tasks or one or more index tasks) inside of a MustEpochLauncher and then issue them as a single launch to the Legion runtime. The Legion runtime is then aware that all the tasks must be capable of executing in parallel and synchronizing with each other. Legion will first check that all of the region requirements for this set of tasks are non-interfering and therefore capable of running in parallel. Legion will also check any mapping decisions which might impact the ability of the tasks to run in parallel. For example, if two tasks in a must-parallelism epoch are mapped onto the same processor they will not be able to run in parallel and potentially synchronize with each other. To help avoid this case, Legion provides an explicit mapping call for mapping must-parallelism epochs map_must_epoch which we describe later. If there are any mapping decisions which would prevent the must-parallelism epoch, Legion issues a runtime error (as opposed to silently hanging).

Our stencil application creates a MustEpochLauncher and inserts a single TaskLauncher for each of the spmd_task instances we want to create. Since all of our region requirements ask for SIMULTANEOUS coherence, we know that there will be no dependences on logical regions. Furthermore, we rely on the implementation of map_must_epoch in the default mapper to ensure that the mapping decisions that are made will not prevent our spmd_task instances from running in parallel.

Explicit Copy Operations

After our spmd_task instances begin running in parallel, they start issuing sub-operations for computing the stencil computation (note we have them iterate over these operations several times in order to illustrate how applications might iterate through time steps in a simulation). Each spmd_task instance first launches a sub-task for initializing the data over which the stencil will be computed. After the data has been initialized, the spmd_task instances then need to exchange ghost cell data so that they can run the stencil sub-task. To exchange data through explicit ghost regions we use Legion’s explicit region-to-region copy operation.

CopyLauncher objects are used to perform an explicit copy of data between two different logical regions. The only requirement of the two logical regions is that they share a common index space tree. (Note this is why we created all of our explicit ghost logical regions from the same index space tree.) Just like all other operations in Legion, CopyLauncher objects use region requirements to name the logical regions that they will be copying between. Legion performs the necessary dependence analysis and will perform the copy when it is safe to do so. Copy operations have their own map_copy call which is invoked in order to determine where the source and destination region requirements are mapped.

In the case of the stencil computation, we use explicit copies to copy data from our local logical region local_lr which contains all of our data to our left and right neighbor explicit ghost regions. This will allow our neighbor spmd_task instances to see the necessary ghost cell data for performing their stencils. At the same time our neighbors will be doing the same operation so we will be able to observe their ghost cell data in our explicit ghost regions. The next obvious question is then how to know when it is safe to consume the data in our explicit ghost regions. We next describe how we perform synchronization between spmd_task instances in the next section.

Phase Barriers

When using SIMULTANEOUS coherence it is up to the application to properly synchronize access to a logical region. While applications are free to construct their own synchronization primitives, Legion also provides two useful synchronization primitives: reservations and phase barriers. Reservations provide an atomic synchronization primitive similar to locks, but capable of operating in a deferred execution environment. Phase barriers provide a producer-consumer synchronization mechanism that allow a set of producer operations to notify a set of consumer operations when data is ready. While both of these operations can be used directly, the common convention in Legion programs is to specify on launcher objects which reservations should be acquired/released and which phase barriers need to be waited on or triggered before and after an operation is executed. We now cover the use of phase barriers in more detail since they are used in our stencil application.

First, it is very important to realize that phase barriers are in no way related to traditional barriers in SPMD programming models such as MPI. Instead, phase barriers are a very light-weight producer-consumer synchronization mechanism. In some ways they are similar to phasers in X10 and named barriers in GPU computing. Phase barriers allow a dynamic number of consumers (possibly from different tasks) to be registered. Once all of these producers have finished running the generation of the barrier will be advanced. Consumers of the phase barrier wait on a particular generation. Only once the generation has been reached will the consumers be allowed to execute.

When a phase barrier is created, it must be told how many possible tasks will be registering producers and/or consumers with it, but the exact number of producers and producers can be dynamically determined. The number of tasks which may be registering producers or consumers is called the participants count. When it is executing, each participant task can launch as sub-operations which either arrive or wait on a specific phase barrier as it wants. Once it is done launching sub-operations that use a specific generation of the phase barrier, it then calls advance_phase_barrier to get the name of the phase barrier corresponding to the next generation. Phase barriers remain valid indefinitely (or until they exhaust the maximum number of generations, usually 2^32) unless they are explicitly deleted by the application.

For our stencil computation, we create two phase barriers for every explicit ghost region that we created. The reason for needing two is that we will need one for each spmd_task to be able to indicate when its copy has been completed, and we need one for allowing a consuming spmd_task to indicate it has consumed the data in the explicit ghost region and is ready for the next pass. We refer to these two barriers as the ready and empty barriers in the stencil code. We create all the necessary phase barriers in the top-level task, and then pass in the necessary phase barriers to each sub-task.

Acquire and Release Operations

The last feature that we need deals with how Legion manages regions which have been mapped with SIMULTANEOUS coherence. Since these physical instances are likely being accessed by multiple different tasks, it is unsound for Legion to allow sub-operations within these tasks to create other copies of the logical region in the memory hierarchy. (We note that Legion expresses this restriction by setting the restricted field in a RegionRequirement to true when invoking map_task or other related mapper calls.) Instead, if the application would like to make copies of the logical region locally within a task context, it must first issue an acquire operation on the logical region. This tells Legion that the application is promising that it has properly synchronized access to the logical region and it is safe to make copies. When the application is done launching sub-operations that may make copies it issues a release operation which then invalidates all the existing copies and makes the previous physical instance again the only valid copy of the data (flushing back any dirty data in the other instances as well). In some ways acquire and release operations are related to release consistency, but in many ways Legion is more flexible as Legion allows anyone to operate on any logical region consistent with its privileges at any time and acquire and release only apply to when the Legion runtime is permitted to make copy physical instances of a logical region.

Acquire and release operations are the same as other operations in Legion and are issued using AcquireLauncher and ReleaseLauncher objects. Instead of naming region requirements, acquire and release operations need only name the logical region and fields they are acquiring and releasing. These operations also need to provide the name of the original PhysicalRegion that the parent task mapped using SIMULTANEOUS coherence. Note that like other Legion operations, acquire and release are issued asynchronously and simply become part of the operation stream. This allows applications to issue other operations unrelated to the acquire and release operations and Legion will automatically figure out where dependences exist.

For our stencil computation, we surround our stencil task launch with acquire and release operations for both of our explicit ghost cell regions. This allows our stencil task to be mapped anywhere in the machine and for copies of our explicit ghost cell regions to be made for the stencil task to use. The release operation after then stencil task then invalidate these copies, restoring the original instance mapped by the spmd_task to be only valid copy of the data. This ensures that when the next copy copies from adjacent spmd_task instances that we will see the correct version of the data in our explicit ghost regions.

Putting Everything Together

Having described all of the features of our explicit ghost region version of the stencil computation, the following picture illustrates how the stencil computation works between a pair of spmd_task instances.

Each of spmd_task instances performs several iterations of the stencil computation. Each iteration issues a copy to exchange ghost cell data with its neighbor. Each copy will arrive at a phase barrier once the copy is complete indicating that the data is ready to be consumed. Each spmd_task also issues acquire operation which wait for the copy operation phase barrier to trigger before being ready. The acquire operations enable copies of the explicit ghost cell regions to be made anywhere in the machine, potentially allowing the stencil task to be executing on a GPU. Finally a release operation is issued that will run once the stencil task is completed. Each release operation then arrives at the other phase barrier for an explicit ghost region indicating that the ghost region is empty and ready to be filled again by the neighboring spmd_task. Note that by using explicit ghost regions with phase barriers and explicit acquire/release operations, each of the spmd_task instances can run in parallel and perform sub-operation launches in parallel, leveraging Legion’s ability to do independent analysis on sub-tasks.

Discussion

One obvious question that might be raised about this example is whether it is overly complex. When considering the alternative we believe that this example is actually relatively simple. If we attempted to do this in MPI, we would need to post asynchronous sends and receives and interleaving them with explicit cudaMemcpy operations to move data back and forth between the GPU. In our Legion version all communication is encoded in the same logical region framework, and no explicit mapping decisions are baked into the code making our version both portable and easily tunable. We believe this example further illustrates the ability of logical regions to abstract data usage and movement.

The remaining two sections describe briefly some of the details and potential extensions to this example.

Mapper Interactions

While not included in the code for this example, there are several important details regarding the mapping of this example that the DefaultMapper implements. First, the default mapper implements the map_must_epoch mapper call, which is similar to the map_task call except it is given a set of tasks that all need to be mapped. It is also given a collection of constraints which specify which logical regions in different tasks must be mapped to the same physical instance in order for tasks to be capable of running in parallel. (SIMULTANEOUS coherence only allows tasks to run in parallel if they use the same physical instance, otherwise the resulting behavior would be undefined. The same is true of ATOMIC coherence.) The default mapper then knows that each of the different tasks must be mapped onto a different target processor, and all of the logical regions with constraints should be assigned the same memory ranking. If for any reason these conditions are violated, the Legion runtime will detect that the tasks cannot run in parallel and therefore will issue a runtime error instead of allowing the application to silently hang.

The default mapper also checks the restricted flag on all region requirements in order to see if the current physical instance is the only one which the mapper will be permitted to map. In these cases there is always exactly one instance in the current_instances map which indicates which memory should be selected for the mapping. Failure to use this memory will result in a failed mapping.

Finally, it is normally expected that tasks will request accessors to all physical instances that they map and therefore these physical instances must be visible from the processor on which the task is going to run. In the case of our explicit ghost region code, this is not always the case. Each of spmd_task instances map the adjacent ghost cell regions with no intention of accessing them using accessors, but instead only with the intention of doing explicit region to region copies. To prevent the runtime from throwing an error because of this, we annotate the region requirements for these regions with the NO_ACCESS_FLAG in order to indicate that the tasks will not be making accessors for these logical regions.

Hiding More Latency: Double Buffering

Lastly, we note here the true potential of explicit logical regions. While this example only illustrates single buffering, applications are free to create multiple explicit ghost logical regions as well as additional phase barriers in order to implement multiple buffering. Using multiple buffering applications can hide even more communication latency by allowing Legion to execute even further into the future. We have yet to explore the full potential of this technique but we believe that it could make a considerable difference for many applications.