This repository contains a summary of my Google Summer of Code 2021 experinece with the LLVM Compiler Infrastructure (as well as some of the current & future work).
- How this repository is organised
- Motivation
- Aims
- Current Results
- Contributions
- Links
- Future Work
- Acknowledgement
Since the changes I made are fundamental to LLVM, I had to work in my own fork of LLVM's trunk. All ready commits can be found in my development branch. Note that these commits have not been reviewed by the members of the LLVM community yet and thus may be altered in the future.
This repository contains some of the scripts I used to automatically fix test failures and collect IR/assembly diffs. Moreover, it has draft versions of blog posts that I (together with my mentors) plan to publish to LLVM's blog. I hope that these posts will be useful to make a point that LLVM needs a byte type, as well as that people new to LLVM and who are unable to find up-to-date tutorials/resources elsewhere can use them.
Note: The aim of this document is to provide a summary of my GSoC experience, and
not to describe implementation or design details nor the semantics of a byte type and
a bytecast
- that is the purpose
of the blog posts.
It is common for compilers, and LLVM in particular, to transform calls to
memcpy
,memmove
, memcmp
and memset
to a number of integer loads/stores of
the corresponding bit width. After, load/store instructions can be optimized
further. However, there are certain problems with this type punning approach.
Firstly, semantics of these functions specify that the memory is
copied/compared/set as-is in bytes, including the padding bits if necessary. On
the other hand, in LLVM IR padding is always poison, and loading poison bits
makes the whole loaded value to be poison as well. If the call to memcpy
or
similar function is substituted with an integer load/store pair, then it is
possible that the copied value turns into poison after the transformation
(e.g. https://alive2.llvm.org/ce/z/xoCTpH). This problem also affects other C++
functions that may be lowered to calls to aforementioned functions and
subsequently to integer load/store pairs (e.g.uninitialised_copy
).
The second problem comes from the fact that unsigned char*
, used in memcpy
and other functions' definitions, can alias with any pointer. Hence,
substituting a call to memcpy
of a pointer with an integer load/store pair
may fool the compiler not to see the escape of the pointer, thus breaking the
alias analysis and the soundness of further optimizations.
The underlying problem is the fact that LLVM IR lowers unsigned char
or
similarly the recently introduced std::byte
to i8
. While both C and C++
define these types as handles to the raw bytes of objects, LLVM IR does not
have a similar type. This means that compiler-introduced type punning can
break the alias analysis and miss the escaped pointer, as was reported in one
of the bug reports here.
In my proposal I have originally outlined the following aims for the project:
-
Introduce a byte type to LLVM ecosystem.
-
Introduce a
bytecast
instruction to LLVM ecosystem. -
Make sure that the new LLVM IR can be converted to bitcode and back.
-
Implement a lowering of a byte type and
bytecast
to SelectionDAG. -
Fix code generation in Clang to produce a byte type for
char
andunsigned char
. -
Implement new optimizations for
memcpy
,memmove
,memcmp
andmemset
to use bytes instead of integers. -
Make sure that everyting works correclty at
O0
-O3
. -
Analyze performance regressions, fix broken optimizations, implement new optimizations.
By the end of my Google Summer of Code project, I have achieved the following results:
-
Both the byte type and
bytecast
instruction where introduced to LLVM IR. I implemented a basic lowering for them in SelectionDAG (byte is mapped to an integer, andbytecast
is a no-op), and adapted the code generation in Clang to produce bytes forunsigned char
/char
and casts where necessary. -
The wrong type-punning optimizations of
memcpy
,memmove
,memcmp
andmemset
were fixed. Moreover, I fixed all necessary optimizations to make sure that C/C++ programs can be compiled at any optimization level correctly. -
Performance regressions on the SPECrate2017 benchmark suite were analysed and fixed. The only program with pending regression (out of 10 programs) is
xalancbmk
. The table below summarises performance regressions, where regression is calculated as:
Regression = [(OldValue - NewValue) / NewValue] x 100%
Program | Compile-time regression, % | Execution-time regression, % | Object file size regression, % |
---|---|---|---|
500.perlbench_r. | 0.38% | -0.88% | -0.98% |
502.gcc_r. | 0.37% | 0.02% | -2.23% |
505.mcf_r. | -5.64% | -0.17% | -0.19% |
520.omnetpp_r. | -0.08% | -0.46% | -1.01% |
523.xalancbmk_r. | 0.10% | -4.83% | -0.17% |
525.x264_r. | 0.22% | -0.40% | -0.01% |
531.deepsjeng_r. | 0.56% | 0.26% | -0.01% |
541.leela_r. | 0.02% | -0.01% | -0.01% |
557.xz_r. | 0.19% | -0.91% | 1.84% |
Below is the list of commits that I have made during the project. I decided to group them logically into 4 groups: LLVM IR changes, Clang changes, changes in optimizations and the rest of changes.
-
Clang
Commits:
-
LLVM IR
Commits:
-
Optimizations
Commits:
- [IR][Mem* fixes][1] Initial lowering for mem* calls
- [GVN][SROA] Fixes for O2 level
- [IR] Implemented bytecast (bitcast x) -> x
- [Transforms] Fixed store value bitcast in InstCombine
- [Transforms] Made bytecast vectorizable
- [Transforms][SLP] Added bytecast support
- [Analysis][TTI] Fixed cost analysis for inlining
- [Transforms][LoopIdiom] Added conversion for bytes to memset
-
Other
Commits:
- [OpenMP][IR] Fixes for OpenMP codegen and tests (Note that changes in this commit will be reverted in the future when both byte and integer strings are be allowed!)
The original proposal can be found here. The initial RFC for a byte type is avaliable here.
During my project, I was also keeping a document where I tracked what has been done, what challenges I have encountered, and my further plans. It is availiable here.
To keep track of performance numbers, I made a spreadsheet where I evaluated performance differences at differentt stages of the project.
I am going to continue working on this project. I plan the folowing contributions nexts, listed by priority:
Currently, old versions of LLVM and the version with a byte type are not
compatible. The primary reason for this is that a string in my version is
always a b8
array, whereas in old LLVM version it is an array of i8
.
This makes impossible to parse global variables, defined as:
; New LLVM
@var = global constant [6 x b8] c"Hello\00"
; Old LLVM
@var = global constant [6 x i8] c"Hello\00"
The solution to this problem is to allow two types of strings: byte and integer ones. At the same time, Clang would produce a byte type string only for C and C++ frontends. For example, LLVM IR above could become:
; New LLVM
@var = global constant [6 x b8] b"Hello\00"
; Old LLVM
@var = global constant [6 x i8] c"Hello\00"
Moreover, const
strings are read-only, and hence a pointer cannot
escape through them. Therefore these could be modelled as an array of i8
values. The drawback of this approach is non-uniformity of what can represent
a string.
Clang has numerous test failures due to new code generation for char
and
unsigned char
. These include simple test failures like checking for an i8
instead of b8
, but also have much more complicated cases which are hard to
fix automatically:
-
Some target-specific intrinsics also use
char
in their type signature. We need to ensure type compatibility. -
Currently all tests invloving
i8
strings are wrong. Addresing 1 would help to avoid that issue, adding byte strings only where needed. -
Some tests require insertion of
bytecast
/bitcast
pairs: this makes rewriting FileCheck directives harder.
xalancbmk
is the last program from SPEC CPU benchmark suite that has
unacceptable regression in execution time of approximately 4-5%.
Interestingly, the regression first appeared when vectorization for
bytecast
s was enabled. My last hypothesis was that -loop-idiom
pass did
not recognise a sequence of stores as a memset
. As a result, memset
call
was not created and the code was vectorized instead - leading to perfromance
regression. However, the fix did not improve the execution time. I plan to
investigate the causes of the regression further and bring it to 0% like I
did for other programs.
I am going to write a series of blog posts covering my work, why LLVM needs a byte type and my experience. These posts are intended to demistify how we can add a byte to LLVM IR, while keeping changes minimal and preserving the correctness. Also, they aim to serve as an example that fundamental changes in LLVM IR are possible and should not be rejected by the community just because "it is too much work". In addition to describing the necessary changes, these blog posts should serve as a valuable resource for junior LLVM developers and contributors (like me!) - describing the API, how to debug LLVM and identify the sources of performance regressions, and how certain optimizations work.
While the results of benchmarking seem promising, there are still a number of optimizations we can implement. For example:
-
Propagating
bytecast
s throughphi
nodesTo help Scalar Evolution, we can propagate
bytecast
s throughphi
nodes. This would then allow us to apply abytecast
, then to find recurrences for an integer value, and finally bitcast the result back. The most obvious use case is abytecast
in a loop, which we want to hoist to the header if possible. For example, consider the following C code:void foo(char* c, int n) { *c = 0; for (int i = 0; i < n; ++i) *c += 3; }
The old lowering of
char
toi8
would produce the follwing optimized IR for this function:; Code optimized to `*c = 3 * n;` define void @foo(i8* nocapture %c, i32 %n) { entry: %cmp = icmp sgt i32 %n, 0 %0 = trunc i32 %n to i8 %1 = mul i8 %0, 3 %storemerge = select i1 %cmp, i8 %1, i8 0 store i8 %storemerge, i8* %c, align 1 ret void }
However, the byte version of this program would produce a diffrent IR, with optimizations blocked at the stage of calulating SCEVs of induction values.
define void @foo(b8* %c, i32 %n) { entry: store b8 bitcast (i8 0 to b8), b8* %c, align 1 %cmp3 = icmp slt i32 0, %n br i1 %cmp3, label %for.body.lr.ph, label %for.end for.body.lr.ph: %c.promoted = load b8, b8* %c, align 1 br label %for.body for.body: %i.04 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ] ; SCEV does not see this recurrence because of the bytecast! %0 = phi b8 [ %c.promoted, %for.body.lr.ph ], [ %1, %for.inc ] %conv = bytecast b8 %0 to i8 %add = add i8 %conv, 3 %1 = bitcast i8 %add to b8 br label %for.inc for.inc: %inc = add nuw nsw i32 %i.04, 1 %cmp = icmp slt i32 %inc, %n br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge, !llvm.loop !6 for.cond.for.end_crit_edge: %.lcssa = phi b8 [ %1, %for.inc ] store b8 %.lcssa, b8* %c, align 1 br label %for.end for.end: ret void }
The solution to this is to hoist
bytecast
to the header, so that the loop body operates on integer values:; ... for.body.lr.ph: %c.promoted = load b8, b8* %c, align 1 %c.bytecast = bytecast b8 %c.promoted to i8 br label %for.body for.body: ; Now the loop body operates on integers only! %0 = phi i8 [ %c.bytecast, %for.body.lr.ph ], [ %add, %for.inc ] %i.04 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ] %add = add i8 %0, 3 br label %for.inc for.inc: ; ... for.cond.for.end_crit_edge: %.lcssa = phi i8 [ %add, %for.inc ] %byte = bitcast i8 %.lcssa to b8 store b8 %byte, b8* %c, align 1
I have already implemented this optimization, but due to some test failures had to revert the commit. I plan to investigate what has caused the problem in the future.
-
Combining
bytecast
sMultiple
bytecast
s of the same byte within the same basic block can be combined together. This can make IR more readable.; Before %b = load b8, b8* %ptr %i1 = bytecast b8 %b to i8 %i2 = bytecast b8 %b to i8 use(%i1) use(%i2) ; After %b = load b8, b8* %ptr %i = bytecast b8 %b to i8 use(%i) use(%i)
Alive2 does not recognise bytes yet. Adding a byte type to its type system will help to enforce the usefulness of this work.
I would like to thank my mentors Nuno Lopes and Juneyoung Lee for their guidance and support throughout the project. I have learnt a lot about Clang's code generation, developed a deeper understanding of LLVM IR semantics, and studied different optimizations, including but not limited to inlining, GVN, SROA, various loop transformations (LoopIdiom, LoopRotate, LICM, etc.) and SLP vectorization.
I would like to thank the members of Seoul National University and Chung-Kil Hur in particular for giving me an access to an ARM machine, so that I can run tests and SPEC CPU benchmarks.
Special thanks to all LLVM community members for their advice and comments.