Stack arrays pass¶
Problem Description¶
In gfortran, -fstack-arrays
will cause all local arrays, including those of
unknown size, to be allocated from stack memory. Gfortran enables this flag by
default at -Ofast
.
Flang already allocates all local arrays on the stack (the
memory-allocation-opt
pass can move large allocations to the heap, but by
default it will not). But there are some cases where temporary arrays are
created on the heap by Flang. Moving these temporary arrays is the goal here.
Proposed Solution¶
One approach would be to write a pass which attempts to move heap allocations to
the stack (like the memory-allocation-opt
pass in reverse). This approach has
the advantage of a self-contained implementation and ensuring that newly added
cases are covered.
Another approach would be to modify each place in which arrays are allocated on
the heap to instead generate a stack allocation. The advantage of the second
approach is that it would be difficult to match up all fir.allocmem
operations
with their associated fir.freemem
. In general, the lifetimes of heap allocated
memory are not constrained by the current function’s stack frame and so cannot
be always converted to stack allocations. It is much easier to swap heap
allocations for stack allocations when they are first generated because the
lifetime information is conveniently available.
For example, to rewrite the heap allocation in the array-value-copy
pass with
a stack allocation using the first approach would require analysis to ensure
that the heap allocation is always freed before the function returns. This is
much more complex than never generating a heap allocation (and free) in the
first place (the second approach).
The plan is to take the more complex first approach so that newly added changes to lowering code do not need to be made to support the stack arrays option. The general problem of determining heap allocation lifetimes can be simplified in this case because only those allocations which are always freed before exit from the same function are possible to be moved to heap allocations in that function’s stack frame. Furthermore, the aim of this pass would be to only move those allocations generated by Flang, and so some of the more difficult cases can be avoided. Where the transformation is not possible, the existing heap allocation will be kept.
Implementation details overview¶
In order to limit the complexity of the proposed pass, it is useful to understand the situations in which Flang will generate heap allocations.
Known Heap Array Allocations¶
Flang allocates most arrays on the stack by default, but there are a few cases where temporary arrays are allocated on the heap:
flang/lib/Optimizer/Transforms/ArrayValueCopy.cpp
flang/lib/Optimizer/Transforms/MemoryAllocation.cpp
flang/lib/Lower/ConvertExpr.cpp
flang/lib/Lower/IntrinsicCall.cpp
flang/lib/Lower/ConvertVariable.cpp
Lowering code is being updated and in the future, temporaries for expressions
will be created in the HLFIR bufferization pass in
flang/lib/Optimizer/HLFIR/Trnasforms/BufferizeHLFIR.cpp
.
ArrayValueCopy.cpp
¶
Memory is allocated for a temporary array in allocateArrayTemp()
. This
temporary array is used to ensure that assignments of one array to itself
produce the required value. E.g.
integer, dimension(5), intent(inout) :: x
x(3,4) = x(1,2)
MemoryAllocation.cpp
¶
The default options for the Memory Allocation transformation ensure that no array allocations, no matter how large, are moved from the stack to the heap.
ConvertExpr.cpp
¶
ConvertExpr.cpp
allocates many array temporaries on the heap:
Passing array arguments by value or when they need re-shaping
Lowering elemental array expressions
Lowering mask expressions
Array constructors
The last two of these cases are not covered by the current stack arrays pass design.
The FIR code generated for mask expressions (the WHERE construct) sets a
boolean variable to indicate whether a heap allocation was necessary. The
allocation is only freed if the variable indicates that the allocation was
performed to begin with. The proposed dataflow analysis is not intelligent
enough to statically determine that the boolean variable will always be true
when the allocation is performed. Beyond this, the control flow in the generated
FIR code passes the allocated memory through fir.result
, resulting in a
different SSA value to be allocated and freed, causing the analysis not to
realise that the allocated memory is freed. The most convenient solution here
would be to generate less complicated FIR code, as the existing codegen has
known bugs: https://github.com/llvm/llvm-project/issues/56921,
https://github.com/llvm/llvm-project/issues/59803.
Code generated for array constructors uses realloc()
to grow the allocated
buffer because the size of the resulting array cannot always be determined
ahead of running the constructor. This makes this temporary unsuitable
for allocation on the stack.
IntrinsicCall.cpp
¶
The existing design is for the runtime to do the allocation and the lowering
code to insert fir.freemem
to remove the allocation. It is not clear whether
this can be replaced by adapting lowering code to do stack allocations and to
pass these to the runtime. This would be a significant change and so is out of
scope of -fstack-arrays
.
ConvertVariable.cpp
¶
See Fortran::lower::MapSymbolAttributes
:
Dummy arguments that are not declared in the current entry point require a skeleton definition. Most such “unused” dummies will not survive into final generated code, but some will. It is illegal to reference one at run time if it does. There are three ways these dummies can be mapped to a value:
a
fir::UndefOp
value: preferable but can’t be used for dummies with dynamic bounds or used to define another dummy.A stack slot. This has intermediate-weight cost but may not be usable for objects with dynamic bounds.
A heap box/descriptor is the heaviest weight option, only used for dynamic objects.
These heap allocations will be out of scope for -fstack-arrays
because the
existing implementation already uses stack allocations (or better) where
possible, and most such dummy arguments will be removed from generated code.
BufferizeHLFIR.cpp
¶
TODO
Detecting Allocations to Move¶
Allocations which could be moved to the stack will be detected by performing a
forward dense data flow analysis using mlir::dataflow::DenseForwardDataFlowAnalysis
.
This analysis will search for SSA values created by a fir.allocmem
which are
always freed using fir.freemem
within the same function.
Tracking allocations by SSA values will miss some cases where address calculations are duplicated in different blocks: resulting in different SSA values as arguments for the allocation and the free. It is believed that simple tracking of SSA values is sufficient for detecting the allocations for array temporaries added by Flang, because these temporaries should be simple arrays: not nested inside of derived types or other arrays.
Arrays allocated by an allocate
statement in source code should not be moved
to the stack. These will be identified by adding an attribute to these
fir.allocmem
operations when they are lowered. Intrinsics such as allocated
and move_alloc
do not need to be supported because the array temporaries moved
to the stack will never be visible in user code.
Only allocations for arrays will be considered for moving to the stack.
When conducting the dense data flow analysis, the pass will assume that existing allocations are not freed inside called functions. This is true for the existing heap array temporary allocations generated by Flang. If an allocation were to be freed inside of a called function, the allocation would presumably not also be freed later in the caller function (as this would be a double-free). Therefore the stack arrays pass would not find where the allocation was freed and so not transform the allocation. It is necessary to make this assumption so that the stack arrays pass can transform array allocations created for pass-by-value function arguments.
Moving Allocations¶
The fir.allocmem
will be replaced by a fir.alloca
with the same arguments.
The fir.freemem
will be removed entirely.
Where possible, the fir.alloca
should be placed in the function entry block
(so we can be sure it only happens once). This may not be possible if the array
has non-constant extents (causing the fir.alloca
to have SSA values as
operands). In this case, the fir.alloca
will be placed immediately after the
last operand becomes available.
If this location is inside a loop (either an mlir::LoopLikeOpInterface
or a
cyclic CFG), the transformation should attempt to use the llvm.stacksave
/
llvm.stackrestore
intrinsics to ensure that the stack does not grow on every
loop iteration. Use of these intrinsics is considered valid when the allocation
and deallocation occur within the same block and there are no other stack
allocations between them. If this is not the case, the existing heap allocation
will be preserved.
FIR attribute¶
A FIR attribute will be added to distinguish fir.allocmem
arising from
allocate
statements in source from fir.allocmem
operations added by Flang.
The attribute will be called "fir.must_be_heap"
and will have a boolean value:
true
meaning that the stack arrays pass must not move the allocation, false
meaning that stack arrays may move the allocation. Not specifying the attribute
will be equivalent to setting it to false
.
Testing Plan¶
FileCheck tests will be written to check each of the above identified sources of heap allocated array temporaries are detected and converted by the new pass.
Another test will check that allocate
statements in source code will not be
moved to the stack.