Legion bio photo

Legion

A Data-Centric Parallel Programming System

Github

Partitioning

Continuing with our implementation of DAXPY, we illustrate how Legion enables applications to further express parallelism by partitioning logical regions into sub-regions and then launching tasks that can operate on different sub-regions in parallel.

Partitioning Index Spaces

The act of partitioning in Legion breaks a set of points represented by an index space into subsets of points, each of which will become index sub-spaces. In our DAXPY example we want to partition our two logical regions into num_subregions different sub-regions. (Note that num_subregions can be controlled by the -b command line parameter now to specify the number of blocks to make.) To do this we must partition the common index space is upon which both logical regions are based. The partition we wish to create will be called ip for ‘index partition’ (line 62). To illustrate two different ways of creating partitions, we’ll call two different versions of the create_index_partition method on the HighLevelRuntime: one for handling cases where the total number of elements is NOT divisible by the number of blocks, and one where it is evenly divisible.

The first step in creating a partition is to create a Domain which describes the the color space of the partition (the Domain must either be an unstructured index space or a 1D Rect). The purpose of a color space is to associate a single color (point within the color space domain) with each index sub-space we wish to make. In this DAXPY example, we create a color_domain with a point for each of the desired blocks (lines 59-60, recall Rect types are inclusive).

We first consider the case where the number of blocks do not evenly divide the number of elements in the index space is (lines 63-79). After creating the color space, we want to color points in the index space we are partitioning to assign them to subregions. We use a DomainColoring object to record our coloring. A DomainColoring is a typedef of an STL map from Colors (unsigned integers) to Domain objects (the typedef can be found in legion_types.h along with other type declarations). We compute an even division of the elements to assign to each sub-region. For each sub-region we create a domain describing the number of elements and place it in the coloring (lines 71-75).

Once we have computed our DomainColoring we are now ready to create the IndexPartition. Creating the partition is done simply by invoking the create_index_partition method with the index space to partition is, a color space Domain, the DomainColoring object, and a boolean indicating whether the partition is disjoint (lines 77-78). The return value is an IndexPartition which is a handle similar to an index space handle for naming the index partition. A partitioning is disjoint whenever every element in the original index space is assigned to at most a single color. When compiled in debug mode, the Legion runtime will check the disjointness of partitions. In the next example, we’ll see a case where a non-disjoint partition is useful. Note that partitions do not need to be total and applications can create partitions which only color a subset of the points in the partition (our partition in this example is total).

In the case where we know that the number of blocks evenly divides the number of elements in the is index space, we can use a productivity construct from the Arrays namespace. The Blockify type is a special type which supports an invertible “preimage” operation on Rect types which can be used to create an implicit coloring. We specify the number of elements to assign to each color, and the Legion runtime uses the Blockify object in conjunction with original index space to compute a total coloring of the index space. Another variant of the overloaded create_index_partition method takes the index space to be partitioned along with the Blockify object and creates the new IndexPartition (lines 80-81).

Obtaining Logical Sub-Regions

While partitions are performed on index spaces, the created index partitions and index sub-spaces are implicitly created on all of the logical regions that were created using the original index space. For example, in our DAXPY application, the is index space was used to create both the input_lr and output_lr logical regions. Therefore, when we created the ip index partition of is we also automatically created the corresponding partitions for both the region trees rooted by input_lr and output_lr. (A quick performance note: the Legion runtime lazily instantiates the data structures for these region trees to avoid costly overheads when dealing with large numbers of partitions and sub-regions.) The following figure shows the resulting index space tree and two region trees for our DAXPY example:



Since the logical partitions and sub-regions are implicitly created, the application initially has no means for obtaining handles to these objects. The Legion runtime supports several ways of acquiring these handles. One example can be seen on line 84 where the application invokes the get_logical_partition method on an instance of the HighLevelRuntime. This method takes a logical region as well as an index partition of the index space used to create the logical region and then returns the corresponding LogicalPartition handle. Additionally, the runtime supports the get_logical_partition_by_color and get_logical_partition_by_tree which provide other ways of obtaining LogicalPartition handles. The runtime also supports the corresponding methods get_logical_subregion, get_logical_subregion_by_color, and get_logical_subregion_by_tree for discovering the handles for logical sub-regions.

Projection Region Requirements

As in the previous DAXPY example, we now want to launch sub-tasks for initializing fields, performing the DAXPY computation, and checking correctness. To take advantage of the partitioning that was performed and increase parallelism we need to launch separate sub-tasks for each of the logical sub-regions that were created. As in an earlier example, we use IndexLauncher objects for launching an index space of tasks. However, unlike launching single tasks, we need a way to specify different RegionRequirement objects for each of the points in the index space of tasks. To accomplish this we use projection region requirements.

Projection region requirements provide a two-step mechanism for assigning a region requirement for each point task in an index space of task launches. First, a projection region requirement first names an upper bound on the privileges to be requested by the index space task. This upper bound can either be a logical region or logical partition. The logical regions eventually requested by each point task in the index space of tasks must request a logical region that is a (potentially non-strict) sub-region of the given upper bound. Second, a projection function is chosen which will compute the sub-region for each point task in the index space of tasks. We now illustrate how these two aspects of projection region requirements work in our DAXPY example.

Projection region requirements are created using a different constructor for the RegionRequirement type. These constructor always begin by specifying either a logical region or logical partition to place an upper bound on the data accessed followed by a projection function ID (lines 93-94). The remaining arguments are the same as other RegionRequirement constructors. In our DAXPY example we use the input_lp and output_lp logical partitions as upper bounds for our index space task launches as each point task will be using a sub-region of these partitions. Our projection region requirements also use the projection ID zero to specify our projection function. The zero projection ID is a reserved ID which we describe momentarily. Applications can also register custom projection functions statically before starting the Legion runtime using the register_region_function and register_partition_function static methods on the HighLevelRuntime similar to how tasks are registered.

The second step of using projection region requirements comes as the index space task is executed. When the runtime enumerates the Domain of index space points, it invokes the specified projection function on each point to compute the logical region requirement for that the task. In the case of our DAXPY example, we use the reserved zero projection function which computes a color from each task’s point in the launch Domain and then uses that color to find the corresponding logical sub-region in the logical partition upper bound.

One requirement of using projection region requirements is that all the points within an index space task launch are required to be non-interfering with each other either because they use disjoint logical regions or because they are using non-interfering privileges (read-only or reduce with same reduction operator). Since Legion lazily enumerates index space launch domains dependent on mapping decisions, violations of this aspect of the programming model will result in runtime error messages which may occur well after the task has been launched.

Finding Index Space Domains

For task implementations, the Legion runtime API provides a mechanism for determining the original Domain for an index space using the get_index_space_domain method. We use this method in all three of our sub-task implementations (lines 148, 174, and 198). Our task implementations can therefore determine the size of the domain to iterate over as part of the implementation. This allows us to register our tasks as being capable of being run as both single and index space tasks (lines 217-224).

Region Non-Interference

In this version of DAXPY, we see an example of how the Legion runtime can extract parallelism from tasks using region non-interference. Since each of the tasks in our index space task launches are using disjoint logical sub-regions, the Legion runtime can infer that these tasks can be run in parallel. The following figure shows the TDG computed for this version of DAXPY. (Note we could also have parallelized the checking task if we so desired.)



This version of DAXPY demonstrates the power of the Legion programming model. By understanding the structure of program data, the runtime can extract parallelism from both field-level and region non-interference at the same time. Using both forms of non-interference to discover both task- and data-level parallelism maximizes the is not something that no other programming model we are aware of is capable of achieving.

Next Example: Multiple Partitions
Previous Example: Privileges

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
#include <cstdio>
#include <cassert>
#include <cstdlib>
#include "legion.h"
using namespace LegionRuntime::HighLevel;
using namespace LegionRuntime::Accessor;

enum TaskIDs {
  TOP_LEVEL_TASK_ID,
  INIT_FIELD_TASK_ID,
  DAXPY_TASK_ID,
  CHECK_TASK_ID,
};

enum FieldIDs {
  FID_X,
  FID_Y,
  FID_Z,
};

void top_level_task(const Task *task,
                    const std::vector<PhysicalRegion> &regions,
                    Context ctx, HighLevelRuntime *runtime) {
  int num_elements = 1024; 
  int num_subregions = 4;
  {
    const InputArgs &command_args = HighLevelRuntime::get_input_args();
    for (int i = 1; i < command_args.argc; i++)
    {
      if (!strcmp(command_args.argv[i],"-n"))
        num_elements = atoi(command_args.argv[++i]);
      if (!strcmp(command_args.argv[i],"-b"))
        num_subregions = atoi(command_args.argv[++i]);
    }
  }
  printf("Running daxpy for %d elements...\n", num_elements);
  printf("Partitioning data into %d sub-regions...\n", num_subregions);

  // Create our logical regions using the same schemas as earlier examples
  Rect<1> elem_rect(Point<1>(0),Point<1>(num_elements-1));
  IndexSpace is = runtime->create_index_space(ctx, 
                          Domain::from_rect<1>(elem_rect));
  FieldSpace input_fs = runtime->create_field_space(ctx);
  {
    FieldAllocator allocator = 
      runtime->create_field_allocator(ctx, input_fs);
    allocator.allocate_field(sizeof(double),FID_X);
    allocator.allocate_field(sizeof(double),FID_Y);
  }
  FieldSpace output_fs = runtime->create_field_space(ctx);
  {
    FieldAllocator allocator = 
      runtime->create_field_allocator(ctx, output_fs);
    allocator.allocate_field(sizeof(double),FID_Z);
  }
  LogicalRegion input_lr = runtime->create_logical_region(ctx, is, input_fs);
  LogicalRegion output_lr = runtime->create_logical_region(ctx, is, output_fs);

  Rect<1> color_bounds(Point<1>(0),Point<1>(num_subregions-1));
  Domain color_domain = Domain::from_rect<1>(color_bounds);

  IndexPartition ip;
  if ((num_elements % num_subregions) != 0) {
    // Not evenly divisible
    const int lower_bound = num_elements/num_subregions;
    const int upper_bound = lower_bound+1;
    const int number_small = num_subregions - (num_elements % num_subregions);
    DomainColoring coloring;
    int index = 0;
    for (int color = 0; color < num_subregions; color++) {
      int num_elmts = color < number_small ? lower_bound : upper_bound;
      assert((index+num_elmts) <= num_elements);
      Rect<1> subrect(Point<1>(index),Point<1>(index+num_elmts-1));
      coloring[color] = Domain::from_rect<1>(subrect);
      index += num_elmts;
    }
    ip = runtime->create_index_partition(ctx, is, color_domain, 
                                      coloring, true/*disjoint*/);
  } else { 
    Blockify<1> coloring(num_elements/num_subregions);
    ip = runtime->create_index_partition(ctx, is, coloring);
  }

  LogicalPartition input_lp = runtime->get_logical_partition(ctx, input_lr, ip);
  LogicalPartition output_lp = runtime->get_logical_partition(ctx, output_lr, ip);

  Domain launch_domain = color_domain; 
  ArgumentMap arg_map;

  IndexLauncher init_launcher(INIT_FIELD_TASK_ID, launch_domain, 
                              TaskArgument(NULL, 0), arg_map);
  init_launcher.add_region_requirement(
      RegionRequirement(input_lp, 0/*projection ID*/, 
                        WRITE_DISCARD, EXCLUSIVE, input_lr));
  init_launcher.region_requirements[0].add_field(FID_X);
  runtime->execute_index_space(ctx, init_launcher);

  init_launcher.region_requirements[0].privilege_fields.clear();
  init_launcher.region_requirements[0].instance_fields.clear();
  init_launcher.region_requirements[0].add_field(FID_Y);
  runtime->execute_index_space(ctx, init_launcher);

  const double alpha = drand48();
  IndexLauncher daxpy_launcher(DAXPY_TASK_ID, launch_domain,
                TaskArgument(&alpha, sizeof(alpha)), arg_map);
  daxpy_launcher.add_region_requirement(
      RegionRequirement(input_lp, 0/*projection ID*/,
                        READ_ONLY, EXCLUSIVE, input_lr));
  daxpy_launcher.region_requirements[0].add_field(FID_X);
  daxpy_launcher.region_requirements[0].add_field(FID_Y);
  daxpy_launcher.add_region_requirement(
      RegionRequirement(output_lp, 0/*projection ID*/,
                        WRITE_DISCARD, EXCLUSIVE, output_lr));
  daxpy_launcher.region_requirements[1].add_field(FID_Z);
  runtime->execute_index_space(ctx, daxpy_launcher);
                    
  TaskLauncher check_launcher(CHECK_TASK_ID, TaskArgument(&alpha, sizeof(alpha)));
  check_launcher.add_region_requirement(
      RegionRequirement(input_lr, READ_ONLY, EXCLUSIVE, input_lr));
  check_launcher.region_requirements[0].add_field(FID_X);
  check_launcher.region_requirements[0].add_field(FID_Y);
  check_launcher.add_region_requirement(
      RegionRequirement(output_lr, READ_ONLY, EXCLUSIVE, output_lr));
  check_launcher.region_requirements[1].add_field(FID_Z);
  runtime->execute_task(ctx, check_launcher);

  runtime->destroy_logical_region(ctx, input_lr);
  runtime->destroy_logical_region(ctx, output_lr);
  runtime->destroy_field_space(ctx, input_fs);
  runtime->destroy_field_space(ctx, output_fs);
  runtime->destroy_index_space(ctx, is);
}

void init_field_task(const Task *task,
                     const std::vector<PhysicalRegion> &regions,
                     Context ctx, HighLevelRuntime *runtime) {
  assert(regions.size() == 1); 
  assert(task->regions.size() == 1);
  assert(task->regions[0].privilege_fields.size() == 1);

  FieldID fid = *(task->regions[0].privilege_fields.begin());
  const int point = task->index_point.point_data[0];
  printf("Initializing field %d for block %d...\n", fid, point);

  RegionAccessor<AccessorType::Generic, double> acc = 
    regions[0].get_field_accessor(fid).typeify<double>();

  Domain dom = runtime->get_index_space_domain(ctx, 
      task->regions[0].region.get_index_space());
  Rect<1> rect = dom.get_rect<1>();
  for (GenericPointInRectIterator<1> pir(rect); pir; pir++) {
    acc.write(DomainPoint::from_point<1>(pir.p), drand48());
  }
}

void daxpy_task(const Task *task,
                const std::vector<PhysicalRegion> &regions,
                Context ctx, HighLevelRuntime *runtime) {
  assert(regions.size() == 2);
  assert(task->regions.size() == 2);
  assert(task->arglen == sizeof(double));
  const double alpha = *((const double*)task->args);
  const int point = task->index_point.point_data[0];

  RegionAccessor<AccessorType::Generic, double> acc_x = 
    regions[0].get_field_accessor(FID_X).typeify<double>();
  RegionAccessor<AccessorType::Generic, double> acc_y = 
    regions[0].get_field_accessor(FID_Y).typeify<double>();
  RegionAccessor<AccessorType::Generic, double> acc_z = 
    regions[1].get_field_accessor(FID_Z).typeify<double>();
  printf("Running daxpy computation with alpha %.8g for point %d...\n", 
          alpha, point);

  Domain dom = runtime->get_index_space_domain(ctx, 
      task->regions[0].region.get_index_space());
  Rect<1> rect = dom.get_rect<1>();
  for (GenericPointInRectIterator<1> pir(rect); pir; pir++) {
    double value = alpha * acc_x.read(DomainPoint::from_point<1>(pir.p)) + 
                           acc_y.read(DomainPoint::from_point<1>(pir.p));
    acc_z.write(DomainPoint::from_point<1>(pir.p), value);
  }
}

void check_task(const Task *task,
                const std::vector<PhysicalRegion> &regions,
                Context ctx, HighLevelRuntime *runtime) {
  assert(regions.size() == 2);
  assert(task->regions.size() == 2);
  assert(task->arglen == sizeof(double));
  const double alpha = *((const double*)task->args);
  RegionAccessor<AccessorType::Generic, double> acc_x = 
    regions[0].get_field_accessor(FID_X).typeify<double>();
  RegionAccessor<AccessorType::Generic, double> acc_y = 
    regions[0].get_field_accessor(FID_Y).typeify<double>();
  RegionAccessor<AccessorType::Generic, double> acc_z = 
    regions[1].get_field_accessor(FID_Z).typeify<double>();
  printf("Checking results...");
  Domain dom = runtime->get_index_space_domain(ctx, 
      task->regions[0].region.get_index_space());
  Rect<1> rect = dom.get_rect<1>();
  bool all_passed = true;
  for (GenericPointInRectIterator<1> pir(rect); pir; pir++) {
    double expected = alpha * acc_x.read(DomainPoint::from_point<1>(pir.p)) + 
                           acc_y.read(DomainPoint::from_point<1>(pir.p));
    double received = acc_z.read(DomainPoint::from_point<1>(pir.p));
    if (expected != received)
      all_passed = false;
  }
  if (all_passed)
    printf("SUCCESS!\n");
  else
    printf("FAILURE!\n");
}

int main(int argc, char **argv) {
  HighLevelRuntime::set_top_level_task_id(TOP_LEVEL_TASK_ID);
  HighLevelRuntime::register_legion_task<top_level_task>(TOP_LEVEL_TASK_ID,
      Processor::LOC_PROC, true/*single*/, false/*index*/);
  HighLevelRuntime::register_legion_task<init_field_task>(INIT_FIELD_TASK_ID,
      Processor::LOC_PROC, true/*single*/, true/*index*/);
  HighLevelRuntime::register_legion_task<daxpy_task>(DAXPY_TASK_ID,
      Processor::LOC_PROC, true/*single*/, true/*index*/);
  HighLevelRuntime::register_legion_task<check_task>(CHECK_TASK_ID,
      Processor::LOC_PROC, true/*single*/, true/*index*/);

  return HighLevelRuntime::start(argc, argv);
}