Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpOptimization] Add BatchMatMul benchmark and [OpOptimization] Further optimize BatchMatMulBroadcast and add OpenMP tests #73

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,13 +227,42 @@ $ mkdir build && cd build
$ cmake -G Ninja .. \
-DCMAKE_BUILD_TYPE=RELEASE \
-DOP_OPTIMIZATION_BENCHMARKS=ON \
-DCMAKE_CXX_COMPILER=clang++ \
xlinsist marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only need to use clang++ in matmul-benchmark, so the modification here should be withdrawn.

-DBUDDY_MLIR_BUILD_DIR=/PATH/TO/BUDDY-MLIR/BUILD/
$ ninja <your target operation benchmark>

// Operation benchamrk supported include:
// - conv2d-nchw-fchw-benchmark
// - matmul-benchmark
```
### matmul-benchmark
`OpenMP` and `lld` LTO is required in matmul-benchmark. To ensure version compatibility with the project, it's recommended to use the LLVM toolchains built within the `buddy-benchmark`. Follow the steps below:
- build llvm toolchains with `lld` and `OpenMP`.
```
$ cd buddy-mlir/llvm/build
$ cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS="mlir;clang;lld;openmp" \
-DLLVM_TARGETS_TO_BUILD="host;RISCV" \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_RUNTIMES=all \
-DOPENMP_ENABLE_LIBOMPTARGET=OFF \
-DCMAKE_BUILD_TYPE=RELEASE
```
Copy link
Collaborator

@xlinsist xlinsist Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a specific construction process is required(in matmul-benchmark), please ensure that the construction process you provide is complete. For example, add the step $ ninja, since in the default construction process we only use $ ninja check-clang check-mlir.

- use the `clang++` in `buddy-mlir/llvm/build/bin`.
```
$ mkdir build && cd build
$ cmake -G Ninja .. \
-DCMAKE_BUILD_TYPE=RELEASE \
-DOP_OPTIMIZATION_BENCHMARKS=ON \
-DCMAKE_CXX_COMPILER=/PATH/TO/BUDDY-MLIR/BUILD/bin/clang++ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/PATH/TO/BUDDY-MLIR/BUILD/bin/clang++ ->
/PATH/TO/BUDDY-MLIR/llvm/build/bin/clang++

-DBUDDY_MLIR_BUILD_DIR=/PATH/TO/BUDDY-MLIR/BUILD/
$ ninja matmul-benchmark
```
- `matmul-benchmark` need to load the `libomp.so` in `buddy-mlir/llvm/build/lib` to execute, here's a temporary way without root.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommand to rephrase the description here like this:

To execute matmul-benchmark in buddy-benchmark/build/bin, The libomp.so file from buddy-mlir/llvm/build/lib need to be loaded. Here is a temporary workaround without root access:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
I have successfully constructed matmul-benchmark and verified the optimization, but there are some warnings using clang++ with argument -no-pie and -fuse-ld=lld. Are you able to reproduce them in your machine and can we eliminate them?


```
$ export LD_LIBRARY_PATH=/PATH/TO/BUDDY-MLIR/BUILD/lib/:$LD_LIBRARY_PATH
```

Run TVM operation optimization benchmark cases.
- Install TVM ([steps](./thirdparty/README.md#tvm)).
Expand Down
8 changes: 8 additions & 0 deletions benchmarks/OpOptimization/MatMul/BatchMatMul.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
module{
func.func @bm_batch_matmul(%a : memref<?x?x?xf32>, %b : memref<?x?x?xf32>, %c : memref<?x?x?xf32>) {
linalg.batch_matmul
ins(%a, %b: memref<?x?x?xf32>, memref<?x?x?xf32>)
outs(%c: memref<?x?x?xf32>)
return
}
}
46 changes: 46 additions & 0 deletions benchmarks/OpOptimization/MatMul/BatchMatMulBroadcast.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// The MLIR prototype of batchmatmul-optimize in buddy-opt.

#map = affine_map<(d0) -> (d0 ceildiv STEP_PLACEHOLDER)>
#tail_len_map = affine_map<(d0) -> (d0 mod STEP_PLACEHOLDER)>
#if_set = affine_set<(d0)[s0] : (s0 - d0 * STEP_PLACEHOLDER >= STEP_PLACEHOLDER)>
#b_col_idx_tail_map = affine_map<(d0) -> (d0 * STEP_PLACEHOLDER)>

func.func @batch_matmul_broadcast_STEP_PLACEHOLDER(%a : memref<?x?x?xf32>, %b : memref<?x?x?xf32>, %c : memref<?x?x?xf32>) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%step = arith.constant STEP_PLACEHOLDER : index
%c0_f32 = arith.constant 0.0 : f32
%c0_f32_vec = vector.splat %c0_f32 : vector<STEP_PLACEHOLDERxf32>

%a_row = memref.dim %a, %c1 : memref<?x?x?xf32>
%a_col = memref.dim %a, %c2 : memref<?x?x?xf32>
%b_row = memref.dim %b, %c1 : memref<?x?x?xf32>
%b_col = memref.dim %b, %c2 : memref<?x?x?xf32>
%batch = memref.dim %a, %c0 : memref<?x?x?xf32>

%tail_len = affine.apply #tail_len_map(%b_col)
%mask_vec = vector.create_mask %tail_len : vector<STEP_PLACEHOLDERxi1>

affine.parallel (%batch_idx) = (0) to (%batch){ // Affine.parallel can be lowered to the omp dialect, which enables batch-level parallelization.
affine.prefetch %a[%batch_idx, %a_row, %a_col], read, locality<3>, data : memref<?x?x?xf32> // Explicitly prefetch, about 5% faster on X86.
affine.for %b_row_idx = 0 to %b_row {
affine.for %b_col_idx = 0 to #map(%b_col) {
%b_vec = affine.vector_load %b[%batch_idx, %b_row_idx, %b_col_idx * STEP_PLACEHOLDER] : memref<?x?x?xf32>, vector<STEP_PLACEHOLDERxf32>
%b_col_idx_tail = affine.apply #b_col_idx_tail_map(%b_col_idx)
affine.for %a_row_idx = 0 to %a_row {
%a_ele = affine.load %a[%batch_idx, %a_row_idx, %b_row_idx] : memref<?x?x?xf32>
%a_vec = vector.broadcast %a_ele : f32 to vector<STEP_PLACEHOLDERxf32>
%c_vec = affine.vector_load %c[%batch_idx, %a_row_idx, %b_col_idx * STEP_PLACEHOLDER] : memref<?x?x?xf32>, vector<STEP_PLACEHOLDERxf32>
%result_vec = vector.fma %a_vec, %b_vec, %c_vec : vector<STEP_PLACEHOLDERxf32>
affine.if #if_set(%b_col_idx)[%b_col] {
affine.vector_store %result_vec, %c[%batch_idx, %a_row_idx, %b_col_idx * STEP_PLACEHOLDER] : memref<?x?x?xf32>, vector<STEP_PLACEHOLDERxf32>
} else {
vector.maskedstore %c[%batch_idx, %a_row_idx, %b_col_idx_tail], %mask_vec, %result_vec : memref<?x?x?xf32>, vector<STEP_PLACEHOLDERxi1>, vector<STEP_PLACEHOLDERxf32>
}
}
}
}
}
return
}
81 changes: 80 additions & 1 deletion benchmarks/OpOptimization/MatMul/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,88 @@ add_custom_command(OUTPUT matmul-scalar.o
add_library(MatMulScalar STATIC matmul-scalar.o)
set_target_properties(MatMulScalar PROPERTIES LINKER_LANGUAGE CXX)

add_custom_command(OUTPUT batch-matmul-scalar.o
COMMAND cat ${BUDDY_SOURCE_DIR}/benchmarks/OpOptimization/MatMul/BatchMatMul.mlir |
sed 's/bm_batch_matmul/batch_matmul_scalar/' |
${LLVM_MLIR_BINARY_DIR}/mlir-opt
-convert-linalg-to-loops
-lower-affine
-convert-scf-to-cf
-convert-vector-to-llvm
-finalize-memref-to-llvm
-convert-arith-to-llvm
-llvm-request-c-wrappers
-convert-func-to-llvm
-reconcile-unrealized-casts |
${LLVM_MLIR_BINARY_DIR}/mlir-translate --mlir-to-llvmir |
${LLVM_MLIR_BINARY_DIR}/llc -O3 -mtriple=${BUDDY_OPT_TRIPLE}
-mattr=${BUDDY_OPT_ATTR} --filetype=obj
-o ${BUDDY_BINARY_DIR}/../benchmarks/OpOptimization/MatMul/batch-matmul-scalar.o
)
add_library(BatchMatMulScalar STATIC batch-matmul-scalar.o)
set_target_properties(BatchMatMulScalar PROPERTIES LINKER_LANGUAGE CXX)

function(build_batch_matmul_broadcast step)
add_custom_command(OUTPUT batch-matmul-broadcast-${step}.o
COMMAND cat ${BUDDY_SOURCE_DIR}/benchmarks/OpOptimization/MatMul/BatchMatMul.mlir |
sed 's/bm_batch_matmul/batch_matmul_broadcast_${step}/g' |
${BUDDY_MLIR_BUILD_DIR}/bin/buddy-opt
-batchmatmul-optimize="vector-size=${step}"
-expand-strided-metadata
-affine-super-vectorize
-lower-affine
-convert-vector-to-llvm
-finalize-memref-to-llvm
-convert-scf-to-cf
-convert-linalg-to-llvm
-llvm-request-c-wrappers
-convert-func-to-llvm
-reconcile-unrealized-casts |
${LLVM_MLIR_BINARY_DIR}/mlir-translate --mlir-to-llvmir |
${LLVM_MLIR_BINARY_DIR}/llc -O3 -mtriple=${BUDDY_OPT_TRIPLE}
-mattr=${BUDDY_OPT_ATTR} --filetype=obj
-o ${BUDDY_BINARY_DIR}/../benchmarks/OpOptimization/MatMul/batch-matmul-broadcast-${step}.o
)
add_library(BatchMatMulBroadcast${step} STATIC batch-matmul-broadcast-${step}.o)
set_target_properties(BatchMatMulBroadcast${step} PROPERTIES LINKER_LANGUAGE CXX)
endfunction()

build_batch_matmul_broadcast(64)

function(build_batch_matmul_broadcast_omp step)
add_custom_command(OUTPUT batch-matmul-broadcast-${step}-omp.o
COMMAND cat ${BUDDY_SOURCE_DIR}/benchmarks/OpOptimization/MatMul/BatchMatMulBroadcast.mlir |
sed 's/batch_matmul_broadcast_STEP_PLACEHOLDER/batch_matmul_broadcast_STEP_PLACEHOLDER_omp/g' |
sed 's/STEP_PLACEHOLDER/${step}/g' |
${BUDDY_MLIR_BUILD_DIR}/bin/buddy-opt
-expand-strided-metadata
-affine-super-vectorize
-lower-affine
-convert-scf-to-openmp
-convert-vector-to-llvm
-finalize-memref-to-llvm
-convert-scf-to-cf
-convert-linalg-to-llvm
-llvm-request-c-wrappers
-convert-openmp-to-llvm
-convert-func-to-llvm
-reconcile-unrealized-casts |
${LLVM_MLIR_BINARY_DIR}/mlir-translate --mlir-to-llvmir |
${CMAKE_CXX_COMPILER} -c -x ir -O3 --target=${BUDDY_OPT_TRIPLE} -fopenmp -march=native -flto
-o ${BUDDY_BINARY_DIR}/../benchmarks/OpOptimization/MatMul/batch-matmul-broadcast-${step}-omp.o -
)
add_library(BatchMatMulBroadcast${step}OMP STATIC batch-matmul-broadcast-${step}-omp.o)
set_target_properties(BatchMatMulBroadcast${step}OMP PROPERTIES LINKER_LANGUAGE CXX)
endfunction()

build_batch_matmul_broadcast_omp(64)

add_executable(matmul-benchmark
Main.cpp
MatMulBenchmark.cpp
)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native -fopenmp -flto -fuse-ld=lld")

target_link_libraries(matmul-benchmark
GoogleBenchmark
Expand All @@ -114,4 +190,7 @@ target_link_libraries(matmul-benchmark
MatMulBroadcast128
MatMulBroadcast256
MatMulScalar
BatchMatMulScalar
BatchMatMulBroadcast64
BatchMatMulBroadcast64OMP
)
6 changes: 4 additions & 2 deletions benchmarks/OpOptimization/MatMul/Main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,15 @@

#include <benchmark/benchmark.h>

void verification();
void matmul_verification();
void batch_matmul_verification();

int main(int argc, char **argv) {
// Run benchmark.
::benchmark::Initialize(&argc, argv);
::benchmark::RunSpecifiedBenchmarks();
// Run correctness verification.
verification();
matmul_verification();
batch_matmul_verification();
return 0;
}
117 changes: 109 additions & 8 deletions benchmarks/OpOptimization/MatMul/MatMulBenchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
//
//===----------------------------------------------------------------------===//

#include <array>
#include <benchmark/benchmark.h>
#include <buddy/Core/Container.h>
#include <iostream>
Expand All @@ -27,6 +28,10 @@
#define M 64
#define N 3136
#define K 576
#define BATCH_M 128
#define BATCH_N 784
#define BATCH_K 72
#define BATCH 16

// Helper functions and variables.
namespace {
Expand Down Expand Up @@ -62,6 +67,14 @@ void _mlir_ciface_matmul_broadcast_256(MemRef<float, 2> *A, MemRef<float, 2> *B,
MemRef<float, 2> *C);
void _mlir_ciface_matmul_scalar(MemRef<float, 2> *A, MemRef<float, 2> *B,
MemRef<float, 2> *C);
void _mlir_ciface_batch_matmul_scalar(MemRef<float, 3> *A, MemRef<float, 3> *B,
MemRef<float, 3> *C);
void _mlir_ciface_batch_matmul_broadcast_64(MemRef<float, 3> *A,
MemRef<float, 3> *B,
MemRef<float, 3> *C);
void _mlir_ciface_batch_matmul_broadcast_64_omp(MemRef<float, 3> *A,
MemRef<float, 3> *B,
MemRef<float, 3> *C);
}

#define DEFINE_MATMUL_BENCHMARK(name, func) \
Expand All @@ -79,6 +92,21 @@ void _mlir_ciface_matmul_scalar(MemRef<float, 2> *A, MemRef<float, 2> *B,
} \
}

#define DEFINE_BATCH_MATMUL_BENCHMARK(name, func) \
void BM_BATCH_MATMUL_##name(benchmark::State &state) { \
intptr_t sizesA[3] = {BATCH, BATCH_M, BATCH_K}; \
intptr_t sizesB[3] = {BATCH, BATCH_K, BATCH_N}; \
intptr_t sizesC[3] = {BATCH, BATCH_M, BATCH_N}; \
\
MemRef<float, 3> A(sizesA, 1.0); \
MemRef<float, 3> B(sizesB, 1.0); \
MemRef<float, 3> C(sizesC, 0); \
\
for (auto _ : state) { \
func(&A, &B, &C); \
} \
}

DEFINE_MATMUL_BENCHMARK(OCV, _mlir_ciface_matmul_ocv)
DEFINE_MATMUL_BENCHMARK(TRANSFORM, _mlir_ciface_matmul_transform)
DEFINE_MATMUL_BENCHMARK(BROADCAST_16, _mlir_ciface_matmul_broadcast_16)
Expand All @@ -87,6 +115,11 @@ DEFINE_MATMUL_BENCHMARK(BROADCAST_64, _mlir_ciface_matmul_broadcast_64)
DEFINE_MATMUL_BENCHMARK(BROADCAST_128, _mlir_ciface_matmul_broadcast_128)
DEFINE_MATMUL_BENCHMARK(BROADCAST_256, _mlir_ciface_matmul_broadcast_256)
DEFINE_MATMUL_BENCHMARK(SCALAR, _mlir_ciface_matmul_scalar)
DEFINE_BATCH_MATMUL_BENCHMARK(SCALAR, _mlir_ciface_batch_matmul_scalar)
DEFINE_BATCH_MATMUL_BENCHMARK(BROADCAST_64,
_mlir_ciface_batch_matmul_broadcast_64)
DEFINE_BATCH_MATMUL_BENCHMARK(BROADCAST_64_OMP,
_mlir_ciface_batch_matmul_broadcast_64_omp)
} // namespace

// Register benchmark cases.
Expand All @@ -98,15 +131,18 @@ BENCHMARK(BM_MATMUL_BROADCAST_32)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_MATMUL_BROADCAST_64)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_MATMUL_BROADCAST_128)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_MATMUL_BROADCAST_256)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_BATCH_MATMUL_SCALAR)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_BATCH_MATMUL_BROADCAST_64)->Unit(benchmark::kMillisecond);
BENCHMARK(BM_BATCH_MATMUL_BROADCAST_64_OMP)->Unit(benchmark::kMillisecond);

/// Correctness Verification
/// The verification does not affect the performance.
/// - Set the scalar case as the criteria.
/// - Input elements are random numbers.
/// - Output elements are initialized to zero.
/// - Compare the output of various optimizations with the scalar version to
/// verify correctness.
void verification() {
// Correctness Verification
// The verification does not affect the performance.
// - Set the scalar case as the criteria.
// - Input elements are random numbers.
// - Output elements are initialized to zero.
// - Compare the output of various optimizations with the scalar version to
// verify correctness.
void matmul_verification() {
// Set the random number generator.
std::random_device rd;
std::mt19937 generator(rd());
Expand Down Expand Up @@ -209,3 +245,68 @@ void verification() {
std::cout << "-----------------------------------------------------------"
<< std::endl;
}

void batch_matmul_verification() {
// Set the random number generator.
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_int_distribution<int> distribution(1, 100);

// Set the layout sizes of input and output memref container.
intptr_t sizesA[3] = {BATCH, BATCH_M, BATCH_K};
intptr_t sizesB[3] = {BATCH, BATCH_K, BATCH_N};
intptr_t sizesC[3] = {BATCH, BATCH_M, BATCH_N};

// Generate input A and input B memref container with random numbers.
const int inputASize = BATCH * (BATCH_M) * (BATCH_K);
// float inputARand[inputASize];
auto inputARand = new std::array<float, inputASize>();
for (int i = 0; i < inputASize; ++i) {
(*inputARand)[i] = distribution(generator);
}
MemRef<float, 3> inputAMemRef(inputARand->data(), sizesA);

const int inputBSize = BATCH * (BATCH_K) * (BATCH_N);
// float inputBRand[inputBSize];
auto inputBRand = new std::array<float, inputBSize>();
for (int i = 0; i < inputBSize; ++i) {
(*inputBRand)[i] = distribution(generator);
}
MemRef<float, 3> inputBMemRef(inputBRand->data(), sizesB);

// Generate output memref container with zero.
const int outputSize = BATCH * (BATCH_M) * (BATCH_N);
MemRef<float, 3> outputScalar(sizesC, 0);
MemRef<float, 3> outputBroadcast64(sizesC, 0);
MemRef<float, 3> outputBroadcast64OMP(sizesC, 0);

// Perform all the matmul implementation.
_mlir_ciface_batch_matmul_scalar(&inputAMemRef, &inputBMemRef, &outputScalar);
_mlir_ciface_batch_matmul_broadcast_64(&inputAMemRef, &inputBMemRef,
&outputBroadcast64);
_mlir_ciface_batch_matmul_broadcast_64_omp(&inputAMemRef, &inputBMemRef,
&outputBroadcast64OMP);

// Get the result array.
auto resultScalar = outputScalar.getData();
auto resultBroadcast64 = outputBroadcast64.getData();
auto resultBroadcast64OMP = outputBroadcast64OMP.getData();

// Print the verfication result.
std::cout << "Batch Matmul Broadcast 64 case: "
<< (areArraysEqual(resultScalar, resultBroadcast64,
outputSize / BATCH)
? PASS
: FAIL)
<< std::endl;

std::cout << "Batch Matmul Broadcast 64 OpenMP case: "
<< (areArraysEqual(resultScalar, resultBroadcast64OMP,
outputSize / BATCH)
? PASS
: FAIL)
<< std::endl;

std::cout << "-----------------------------------------------------------"
<< std::endl;
}