Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing optimizations in flat-array.sml benchmark #71

Open
MatthewFluet opened this issue Jul 22, 2014 · 3 comments
Open

Missing optimizations in flat-array.sml benchmark #71

MatthewFluet opened this issue Jul 22, 2014 · 3 comments

Comments

@MatthewFluet
Copy link
Member

The flat-array.sml benchmark was added to demonstrate the DeepFlatten SSA2 optimization.

However, the C and LLVM codegens, at sufficiently high optimization levels, are able to eliminate the arithmetic computation from the benchmark:

CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
RAM: 8GB
OS: Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-67-generic x86_64)
MLton: g0c2c0cd
GCC: 4.6.3
LLVM: 3.4.1

MLton0 -- mlton -codegen amd64 
MLton1 -- mlton -codegen amd64 -drop-pass deepFlatten
MLton2 -- mlton -codegen llvm -llvm-llc-opt -O1 -llvm-opt-opt -O1 
MLton3 -- mlton -codegen llvm -llvm-llc-opt -O1 -llvm-opt-opt -O1 -drop-pass deepFlatten
MLton4 -- mlton -codegen llvm -llvm-llc-opt -O1 -llvm-opt-opt -mem2reg -llvm-opt-opt -O1 
MLton5 -- mlton -codegen llvm -llvm-llc-opt -O1 -llvm-opt-opt -mem2reg -llvm-opt-opt -O1 -drop-pass deepFlatten
MLton6 -- mlton -codegen llvm -llvm-llc-opt -O2 -llvm-opt-opt -O2 
MLton7 -- mlton -codegen llvm -llvm-llc-opt -O2 -llvm-opt-opt -O2 -drop-pass deepFlatten
MLton8 -- mlton -codegen llvm -llvm-llc-opt -O2 -llvm-opt-opt -mem2reg -llvm-opt-opt -O2 
MLton9 -- mlton -codegen llvm -llvm-llc-opt -O2 -llvm-opt-opt -mem2reg -llvm-opt-opt -O2 -drop-pass deepFlatten
MLton10 -- mlton -codegen c -cc-opt -O1 
MLton11 -- mlton -codegen c -cc-opt -O1 -drop-pass deepFlatten
MLton12 -- mlton -codegen c -cc-opt -O2 
MLton13 -- mlton -codegen c -cc-opt -O2 -drop-pass deepFlatten
run time ratio
benchmark  MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13
flat-array   1.00   1.52   1.52   1.64   0.00   0.00   1.00   1.53   0.00   0.00    1.98    2.01    0.00    0.00
size
benchmark   MLton0  MLton1  MLton2  MLton3  MLton4  MLton5  MLton6  MLton7  MLton8  MLton9 MLton10 MLton11 MLton12 MLton13
flat-array 115,692 115,772 118,252 118,460 114,524 114,588 117,836 118,012 114,556 114,620 114,668 114,668 114,140 114,060
compile time
benchmark  MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13
flat-array   2.23   2.17   2.30   2.24   2.28   2.26   2.31   2.26   2.29   2.24    2.33    2.26    2.35    2.29
run time
benchmark  MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11 MLton12 MLton13
flat-array  34.93  53.12  53.10  57.36   0.00   0.02  34.84  53.48   0.00   0.02   69.33   70.36    0.00    0.02

Note that the LLVM and C codegens are able to optimize the computation whether or not the DeepFlatten SSA2 optimization is performed.

Presumably, the LLVM and C compilers are able to observe that the sum value is unused and the Vector.foldl loop must terminate.

While it would be simple to "fix" the benchmark by adding a dependency on the sum value (e.g., val _ = if 1105694191 <> sum then raise Fail "bug" else ()), it would be desirable to first capture this optimization via an SSA optimization.

@DanielRosenwasser
Copy link
Contributor

Would RemoveUnused be that optimization pass? Looks like the issue may partially be documented in the comments, but I don't know enough about the exact details.

@MatthewFluet
Copy link
Member Author

I'm not sure if either RemoveUnused or Useless could be improved to handle this case, though RemoveUnused would be more likely.

Useless actually removed useless components of data structures. The comment there isn't really relevant to the missed optimization in flat-array.sml, because the comment there is concerned with inter-procedural control and data flow, while in flat-array.sml, the elimination of the sum computation would arise after sufficient inlining to turn the Overflow raise/handle into an intra-procedural control flow.

@MatthewFluet
Copy link
Member Author

I believe that RemoveUnused could be improved to eliminate the computation of the sum value. Currently, the arguments of an Arith transfer and the test of a Case transfer are marked as used if the transfer is reachable (see remove-unused.fun#L474 and remove-unused.fun#L559). This is correct, but conservative.

A less conservative optimization would only mark the arguments of an Arith transfer and the test of a Case transfer as used if there is a used computation that has a control-dependence on the conditional branch (see Engineering a Compiler (Cooper & Torczon, Chapter 10.3.1) or Efficiently Computing Static Single Assignment Form and the Program Dependence Graph (Cytron et. al., Section 7.1)). However, marking the control-dependencies of a computation requires computing the reverse dominance frontier of the (block containing the) used computation. Also, it is not clear to me that the Cooper & Torczon and Cytron et. al. algorithms properly handle non-terminating, but conditional loops. I believe that one should mark the arguments of the conditional branches of reachable loop headers as useful. (Alternatively, this may be handled by adding dummy edges from loop headers to the exit node for the purpose of computing the reverse dominance frontier.)

MatthewFluet added a commit to MatthewFluet/mlton that referenced this issue Nov 5, 2019
The C and LLVM codegens, at sufficiently high optimization levels, are
able to eliminate the arithmetic computation from the benchmark.
Presumably, the LLVM and C compilers are able to observe that the
`sum` value is unused and the `Vector.foldl` loop must terminate.

See MLton#71.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants