DO CONCURRENT mapping to OpenMP

This document seeks to describe the effort to parallelize do concurrent loops by mapping them to OpenMP worksharing constructs. The goals of this document are:

Usage

In order to enable do concurrent to OpenMP mapping, flang adds a new compiler flag: -fdo-concurrent-to-openmp. This flag has 3 possible values:

  1. host: this maps do concurrent loops to run in parallel on the host CPU. This maps such loops to the equivalent of omp parallel do.

  2. device: this maps do concurrent loops to run in parallel on a target device. This maps such loops to the equivalent of omp target teams distribute parallel do.

  3. none: this disables do concurrent mapping altogether. In that case, such loops are emitted as sequential loops.

The -fdo-concurrent-to-openmp compiler switch is currently available only when OpenMP is also enabled. So you need to provide the following options to flang in order to enable it:

flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...

For mapping to device, the target device architecture must be specified as well. See -fopenmp-targets and --offload-arch for more info.

Current status

Under the hood, do concurrent mapping is implemented in the DoConcurrentConversionPass. This is still an experimental pass which means that:

  • It has been tested in a very limited way so far.

  • It has been tested mostly on simple synthetic inputs.

Loop nest detection

On the FIR dialect level, the following loop:

  do concurrent(i=1:n, j=1:m, k=1:o)
    a(i,j,k) = i + j + k
  end do

is modelled as a nest of fir.do_loop ops such that an outer loop’s region contains only the following:

  1. The operations needed to assign/update the outer loop’s induction variable.

  2. The inner loop itself.

So the MLIR structure for the above example looks similar to the following:

  fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
    %i_idx_2 = fir.convert %i_idx : (index) -> i32
    fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>

    fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
      %j_idx_2 = fir.convert %j_idx : (index) -> i32
      fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>

      fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
        %k_idx_2 = fir.convert %k_idx : (index) -> i32
        fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>

        ... loop nest body goes here ...
      }
    }
  }

This applies to multi-range loops in general; they are represented in the IR as a nest of fir.do_loop ops with the above nesting structure.

Therefore, the pass detects such “perfectly” nested loop ops to identify multi-range loops and map them as “collapsed” loops in OpenMP.

Further info regarding loop nest detection

Loop nest detection is currently limited to the scenario described in the previous section. However, this is quite limited and can be extended in the future to cover more cases. At the moment, for the following loop nest, even though both loops are perfectly nested, only the outer loop is parallelized:

do concurrent(i=1:n)
  do concurrent(j=1:m)
    a(i,j) = i * j
  end do
end do

Similarly, for the following loop nest, even though the intervening statement x = 41 does not have any memory effects that would affect parallelization, this nest is not parallelized either (only the outer loop is).

do concurrent(i=1:n)
  x = 41
  do concurrent(j=1:m)
    a(i,j) = i * j
  end do
end do

The above also has the consequence that the j variable will not be privatized in the OpenMP parallel/target region. In other words, it will be treated as if it was a shared variable. For more details about privatization, see the “Data environment” section below.

See flang/test/Transforms/DoConcurrent/loop_nest_test.f90 for more examples of what is and is not detected as a perfect loop nest.

Single-range loops

Given the following loop:

  do concurrent(i=1:n)
    a(i) = i * i
  end do

Mapping to host

Mapping this loop to the host, generates MLIR operations of the following structure:

%4 = fir.address_of(@_QFEa) ...
%6:2 = hlfir.declare %4 ...

omp.parallel {
  // Allocate private copy for `i`.
  // TODO Use delayed privatization.
  %19 = fir.alloca i32 {bindc_name = "i"}
  %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...

  omp.wsloop {
    omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
      %23 = fir.convert %arg0 : (index) -> i32
      // Use the privatized version of `i`.
      fir.store %23 to %20#1 : !fir.ref<i32>
      ...

      // Use "shared" SSA value of `a`.
      %42 = hlfir.designate %6#0
      hlfir.assign %35 to %42
      ...
      omp.yield
    }
    omp.terminator
  }
  omp.terminator
}

Mapping to device

Multi-range loops

The pass currently supports multi-range loops as well. Given the following example:

   do concurrent(i=1:n, j=1:m)
       a(i,j) = i * j
   end do

The generated omp.loop_nest operation look like:

omp.loop_nest (%arg0, %arg1)
    : index = (%17, %19) to (%18, %20)
    inclusive step (%c1_2, %c1_4) {
  fir.store %arg0 to %private_i#1 : !fir.ref<i32>
  fir.store %arg1 to %private_j#1 : !fir.ref<i32>
  ...
  omp.yield
}

It is worth noting that we have privatized versions for both iteration variables: i and j. These are locally allocated inside the parallel/target OpenMP region similar to what the single-range example in previous section shows.

Data environment

By default, variables that are used inside a do concurrent loop nest are either treated as shared in case of mapping to host, or mapped into the target region using a map clause in case of mapping to device. The only exceptions to this are:

  1. the loop’s iteration variable(s) (IV) of perfect loop nests. In that case, for each IV, we allocate a local copy as shown by the mapping examples above.

  2. any values that are from allocations outside the loop nest and used exclusively inside of it. In such cases, a local privatized copy is created in the OpenMP region to prevent multiple teams of threads from accessing and destroying the same memory block, which causes runtime issues. For an example of such cases, see flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90.

Implicit mapping detection (for mapping to the target device) is still quite limited and work to make it smarter is underway for both OpenMP in general and do concurrent mapping.

Non-perfectly-nested loops’ IVs

For non-perfectly-nested loops, the IVs are still treated as shared or map entries as pointed out above. This might not be consistent with what the Fortran specification tells us. In particular, taking the following snippets from the spec (version 2023) into account:

§ 3.35

construct entity entity whose identifier has the scope of a construct

§ 19.4

A variable that appears as an index-name in a FORALL or DO CONCURRENT construct […] is a construct entity. A variable that has LOCAL or LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity. […] The name of a variable that appears as an index-name in a DO CONCURRENT construct, FORALL statement, or FORALL construct has a scope of the statement or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO CONCURRENT construct has the scope of that construct.

From the above quotes, it seems there is an equivalence between the IV of a do concurrent loop and a variable with a LOCAL locality specifier (equivalent to OpenMP’s private clause). Which means that we should probably localize/privatize a do concurrent loop’s IV even if it is not perfectly nested in the nest we are parallelizing. For now, however, we do not do that as pointed out previously. In the near future, we propose a middle-ground solution (see the Next steps section for more details).

Next steps

This section describes some of the open questions/issues that are not tackled yet even in the downstream implementation.

Separate MLIR op for do concurrent

At the moment, both increment and concurrent loops are represented by one MLIR op: fir.do_loop; where we differentiate concurrent loops with the unordered attribute. This is not ideal since the fir.do_loop op support only single iteration ranges. Consequently, to model multi-range do concurrent loops, flang emits a nest of fir.do_loop ops which we have to detect in the OpenMP conversion pass to handle multi-range loops. Instead, it would better to model multi-range concurrent loops using a separate op which the IR more representative of the input Fortran code and also easier to detect and transform.

Delayed privatization

So far, we emit the privatization logic for IVs inline in the parallel/target region. This is enough for our purposes right now since we don’t localize/privatize any sophisticated types of variables yet. Once we have need for more advanced localization through do concurrent’s locality specifiers (see below), delayed privatization will enable us to have a much cleaner IR. Once delayed privatization’s implementation upstream is supported for the required constructs by the pass, we will move to it rather than inlined/early privatization.

Locality specifiers for do concurrent

Locality specifiers will enable the user to control the data environment of the loop nest in a more fine-grained way. Implementing these specifiers on the FIR dialect level is needed in order to support this in the DoConcurrentConversionPass.

Such specifiers will also unlock a potential solution to the non-perfectly-nested loops’ IVs issue described above. In particular, for a non-perfectly nested loop, one middle-ground proposal/solution would be to:

  • Emit the loop’s IV as shared/mapped just like we do currently.

  • Emit a warning that the IV of the loop is emitted as shared/mapped.

  • Given support for LOCAL, we can recommend the user to explicitly localize/privatize the loop’s IV if they choose to.

Sharing TableGen clause records from the OpenMP dialect

At the moment, the FIR dialect does not have a way to model locality specifiers on the IR level. Instead, something similar to early/eager privatization in OpenMP is done for the locality specifiers in fir.do_loop ops. Having locality specifier modelled in a way similar to delayed privatization (i.e. the omp.private op) and reductions (i.e. the omp.declare_reduction op) can make mapping do concurrent to OpenMP (and other parallel programming models) much easier.

Therefore, one way to approach this problem is to extract the TableGen records for relevant OpenMP clauses in a shared dialect for “data environment management” and use these shared records for OpenMP, do concurrent, and possibly OpenACC as well.

Supporting reductions

Similar to locality specifiers, mapping reductions from do concurrent to OpenMP is also still an open TODO. We can potentially extend the MLIR infrastructure proposed in the previous section to share reduction records among the different relevant dialects as well.

More advanced detection of loop nests

As pointed out earlier, any intervening code between the headers of 2 nested do concurrent loops prevents us from detecting this as a loop nest. In some cases this is overly conservative. Therefore, a more flexible detection logic of loop nests needs to be implemented.

Data-dependence analysis

Right now, we map loop nests without analysing whether such mapping is safe to do or not. We probably need to at least warn the user of unsafe loop nests due to loop-carried dependencies.

Non-rectangular loop nests

So far, we did not need to use the pass for non-rectangular loop nests. For example:

do concurrent(i=1:n)
  do concurrent(j=i:n)
    ...
  end do
end do

We defer this to the (hopefully) near future when we get the conversion in a good share for the samples/projects at hand.

Generalizing the pass to other parallel programming models

Once we have a stable and capable do concurrent to OpenMP mapping, we can take this in a more generalized direction and allow the pass to target other models; e.g. OpenACC. This goal should be kept in mind from the get-go even while only targeting OpenMP.

Upstreaming status

  • [x] Command line options for flang and bbc.

  • [x] Conversion pass skeleton (no transormations happen yet).

  • [x] Status description and tracking document (this document).

  • [x] Loop nest detection to identify multi-range loops.

  • [ ] Basic host/CPU mapping support.

  • [ ] Basic device/GPU mapping support.

  • [ ] More advanced host and device support (expaned to multiple items as needed).