DO CONCURRENT
mapping to OpenMP¶
This document seeks to describe the effort to parallelize do concurrent
loops
by mapping them to OpenMP worksharing constructs. The goals of this document
are:
Describing how to instruct
flang
to mapDO CONCURRENT
loops to OpenMP constructs.Tracking the current status of such mapping.
Describing the limitations of the current implementation.
Describing next steps.
Tracking the current upstreaming status (from the AMD ROCm fork).
Usage¶
In order to enable do concurrent
to OpenMP mapping, flang
adds a new
compiler flag: -fdo-concurrent-to-openmp
. This flag has 3 possible values:
host
: this mapsdo concurrent
loops to run in parallel on the host CPU. This maps such loops to the equivalent ofomp parallel do
.device
: this mapsdo concurrent
loops to run in parallel on a target device. This maps such loops to the equivalent ofomp target teams distribute parallel do
.none
: this disablesdo concurrent
mapping altogether. In that case, such loops are emitted as sequential loops.
The -fdo-concurrent-to-openmp
compiler switch is currently available only when
OpenMP is also enabled. So you need to provide the following options to flang in
order to enable it:
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
For mapping to device, the target device architecture must be specified as well.
See -fopenmp-targets
and --offload-arch
for more info.
Current status¶
Under the hood, do concurrent
mapping is implemented in the
DoConcurrentConversionPass
. This is still an experimental pass which means
that:
It has been tested in a very limited way so far.
It has been tested mostly on simple synthetic inputs.
Loop nest detection¶
On the FIR
dialect level, the following loop:
do concurrent(i=1:n, j=1:m, k=1:o)
a(i,j,k) = i + j + k
end do
is modelled as a nest of fir.do_loop
ops such that an outer loop’s region
contains only the following:
The operations needed to assign/update the outer loop’s induction variable.
The inner loop itself.
So the MLIR structure for the above example looks similar to the following:
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
%i_idx_2 = fir.convert %i_idx : (index) -> i32
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
%j_idx_2 = fir.convert %j_idx : (index) -> i32
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
%k_idx_2 = fir.convert %k_idx : (index) -> i32
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
... loop nest body goes here ...
}
}
}
This applies to multi-range loops in general; they are represented in the IR as
a nest of fir.do_loop
ops with the above nesting structure.
Therefore, the pass detects such “perfectly” nested loop ops to identify multi-range loops and map them as “collapsed” loops in OpenMP.
Further info regarding loop nest detection¶
Loop nest detection is currently limited to the scenario described in the previous section. However, this is quite limited and can be extended in the future to cover more cases. At the moment, for the following loop nest, even though both loops are perfectly nested, only the outer loop is parallelized:
do concurrent(i=1:n)
do concurrent(j=1:m)
a(i,j) = i * j
end do
end do
Similarly, for the following loop nest, even though the intervening statement x = 41
does not have any memory effects that would affect parallelization, this nest is
not parallelized either (only the outer loop is).
do concurrent(i=1:n)
x = 41
do concurrent(j=1:m)
a(i,j) = i * j
end do
end do
The above also has the consequence that the j
variable will not be
privatized in the OpenMP parallel/target region. In other words, it will be
treated as if it was a shared
variable. For more details about privatization,
see the “Data environment” section below.
See flang/test/Transforms/DoConcurrent/loop_nest_test.f90
for more examples
of what is and is not detected as a perfect loop nest.
Single-range loops¶
Given the following loop:
do concurrent(i=1:n)
a(i) = i * i
end do
Mapping to host
¶
Mapping this loop to the host
, generates MLIR operations of the following
structure:
%4 = fir.address_of(@_QFEa) ...
%6:2 = hlfir.declare %4 ...
omp.parallel {
// Allocate private copy for `i`.
// TODO Use delayed privatization.
%19 = fir.alloca i32 {bindc_name = "i"}
%20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
omp.wsloop {
omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
%23 = fir.convert %arg0 : (index) -> i32
// Use the privatized version of `i`.
fir.store %23 to %20#1 : !fir.ref<i32>
...
// Use "shared" SSA value of `a`.
%42 = hlfir.designate %6#0
hlfir.assign %35 to %42
...
omp.yield
}
omp.terminator
}
omp.terminator
}
Mapping to device
¶
Multi-range loops¶
The pass currently supports multi-range loops as well. Given the following example:
do concurrent(i=1:n, j=1:m)
a(i,j) = i * j
end do
The generated omp.loop_nest
operation look like:
omp.loop_nest (%arg0, %arg1)
: index = (%17, %19) to (%18, %20)
inclusive step (%c1_2, %c1_4) {
fir.store %arg0 to %private_i#1 : !fir.ref<i32>
fir.store %arg1 to %private_j#1 : !fir.ref<i32>
...
omp.yield
}
It is worth noting that we have privatized versions for both iteration
variables: i
and j
. These are locally allocated inside the parallel/target
OpenMP region similar to what the single-range example in previous section
shows.
Data environment¶
By default, variables that are used inside a do concurrent
loop nest are
either treated as shared
in case of mapping to host
, or mapped into the
target
region using a map
clause in case of mapping to device
. The only
exceptions to this are:
the loop’s iteration variable(s) (IV) of perfect loop nests. In that case, for each IV, we allocate a local copy as shown by the mapping examples above.
any values that are from allocations outside the loop nest and used exclusively inside of it. In such cases, a local privatized copy is created in the OpenMP region to prevent multiple teams of threads from accessing and destroying the same memory block, which causes runtime issues. For an example of such cases, see
flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90
.
Implicit mapping detection (for mapping to the target device) is still quite
limited and work to make it smarter is underway for both OpenMP in general
and do concurrent
mapping.
Non-perfectly-nested loops’ IVs¶
For non-perfectly-nested loops, the IVs are still treated as shared
or
map
entries as pointed out above. This might not be consistent with what
the Fortran specification tells us. In particular, taking the following
snippets from the spec (version 2023) into account:
§ 3.35
construct entity entity whose identifier has the scope of a construct
§ 19.4
A variable that appears as an index-name in a FORALL or DO CONCURRENT construct […] is a construct entity. A variable that has LOCAL or LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity. […] The name of a variable that appears as an index-name in a DO CONCURRENT construct, FORALL statement, or FORALL construct has a scope of the statement or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO CONCURRENT construct has the scope of that construct.
From the above quotes, it seems there is an equivalence between the IV of a do concurrent
loop and a variable with a LOCAL
locality specifier (equivalent
to OpenMP’s private
clause). Which means that we should probably
localize/privatize a do concurrent
loop’s IV even if it is not perfectly
nested in the nest we are parallelizing. For now, however, we do not do
that as pointed out previously. In the near future, we propose a middle-ground
solution (see the Next steps section for more details).
Next steps¶
This section describes some of the open questions/issues that are not tackled yet even in the downstream implementation.
Separate MLIR op for do concurrent
¶
At the moment, both increment and concurrent loops are represented by one MLIR
op: fir.do_loop
; where we differentiate concurrent loops with the unordered
attribute. This is not ideal since the fir.do_loop
op support only single
iteration ranges. Consequently, to model multi-range do concurrent
loops, flang
emits a nest of fir.do_loop
ops which we have to detect in the OpenMP conversion
pass to handle multi-range loops. Instead, it would better to model multi-range
concurrent loops using a separate op which the IR more representative of the input
Fortran code and also easier to detect and transform.
Delayed privatization¶
So far, we emit the privatization logic for IVs inline in the parallel/target
region. This is enough for our purposes right now since we don’t
localize/privatize any sophisticated types of variables yet. Once we have need
for more advanced localization through do concurrent
’s locality specifiers
(see below), delayed privatization will enable us to have a much cleaner IR.
Once delayed privatization’s implementation upstream is supported for the
required constructs by the pass, we will move to it rather than inlined/early
privatization.
Locality specifiers for do concurrent
¶
Locality specifiers will enable the user to control the data environment of the
loop nest in a more fine-grained way. Implementing these specifiers on the
FIR
dialect level is needed in order to support this in the
DoConcurrentConversionPass
.
Such specifiers will also unlock a potential solution to the non-perfectly-nested loops’ IVs issue described above. In particular, for a non-perfectly nested loop, one middle-ground proposal/solution would be to:
Emit the loop’s IV as shared/mapped just like we do currently.
Emit a warning that the IV of the loop is emitted as shared/mapped.
Given support for
LOCAL
, we can recommend the user to explicitly localize/privatize the loop’s IV if they choose to.
Supporting reductions¶
Similar to locality specifiers, mapping reductions from do concurrent
to OpenMP
is also still an open TODO. We can potentially extend the MLIR infrastructure
proposed in the previous section to share reduction records among the different
relevant dialects as well.
More advanced detection of loop nests¶
As pointed out earlier, any intervening code between the headers of 2 nested
do concurrent
loops prevents us from detecting this as a loop nest. In some
cases this is overly conservative. Therefore, a more flexible detection logic
of loop nests needs to be implemented.
Data-dependence analysis¶
Right now, we map loop nests without analysing whether such mapping is safe to do or not. We probably need to at least warn the user of unsafe loop nests due to loop-carried dependencies.
Non-rectangular loop nests¶
So far, we did not need to use the pass for non-rectangular loop nests. For example:
do concurrent(i=1:n)
do concurrent(j=i:n)
...
end do
end do
We defer this to the (hopefully) near future when we get the conversion in a good share for the samples/projects at hand.
Generalizing the pass to other parallel programming models¶
Once we have a stable and capable do concurrent
to OpenMP mapping, we can take
this in a more generalized direction and allow the pass to target other models;
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
targeting OpenMP.
Upstreaming status¶
[x] Command line options for
flang
andbbc
.[x] Conversion pass skeleton (no transormations happen yet).
[x] Status description and tracking document (this document).
[x] Loop nest detection to identify multi-range loops.
[ ] Basic host/CPU mapping support.
[ ] Basic device/GPU mapping support.
[ ] More advanced host and device support (expaned to multiple items as needed).