From 174489db58a0952db7683dd6c1832fa48090dd5a Mon Sep 17 00:00:00 2001
From: Jan Eitzinger <jan@moebiusband.org>
Date: Wed, 13 Nov 2024 10:42:40 +0100
Subject: [PATCH] Update and extend README

---
 README.md | 251 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 191 insertions(+), 60 deletions(-)

diff --git a/README.md b/README.md
index f08ae407..f96e1ba0 100644
--- a/README.md
+++ b/README.md
@@ -4,88 +4,219 @@
 |-----------------------------------------|
 
 MD-Bench is a toolbox for the performance engineering of short-range force
-calculation kernels on molecular-dynamics applications. It aims at covering all
-available state-of-the-art algorithms from different community codes such as
+calculation kernels on molecular-dynamics applications. It aims at covering
+state-of-the-art algorithms from different community codes such as
 LAMMPS and GROMACS.
 
-## Build instructions
+## Getting started
 
-Properly configure your building by changing `config.mk` file. The following
-options are available:
+Clone the repository from GitHub:
 
-- **TOOLCHAIN:** Compiler toolchain (available options: GCC, CLANG, ICC, ONEAPI, NVCC).
-- **ISA:** Instruction set (available options: ARM, X86). Only relevant with
-SIMD other than NONE.
-- **SIMD:** Instruction set (available options: NONE, SSE, AVX, AVX\_FMA, AVX2, AVX512).
-- **MASK\_REGISTERS:** Use AVX512 mask registers (always true when ISA is set to AVX512).
-- **OPT\_SCHEME:** Optimization algorithm (available options: verletlist, clusterpair).
-- **ENABLE\_LIKWID:** Enable likwid to make use of HPM counters.
-- **DATA\_TYPE:** Floating-point precision (available options: SP, DP).
-- **DATA\_LAYOUT:** Data layout for atom vector properties (available options: AOS, SOA).
-- **ASM\_SYNTAX:** Assembly syntax to use when generating assembly files (available options: ATT, INTEL).
-- **DEBUG:** Toggle debug mode.
-- **ONE\_ATOM\_TYPE:** Simulate only one atom type and do not perform table lookup for parameters.
-- **MEM\_TRACER:** Trace memory addresses for cache simulator.
-- **INDEX\_TRACER:** Trace indexes and distances for gather-md.
-- **COMPUTE\_STATS:** Compute statistics.
+```shell=
+git clone https://github.com/RRZE-HPC/MD-Bench.git
+```
+
+Edit config.mk and configure the compiler toolchain
 
-Configurations for LAMMPS Verlet Lists optimization scheme:
+```makefile=
+# Compiler tool chain (GCC/CLANG/ICC/ICX/ONEAPI/NVCC)
+TOOLCHAIN ?= CLANG
+```
 
-- **ENABLE\_OMP\_SIMD:** Use omp simd pragma on half neighbor-lists kernels.
-- **USE\_SIMD\_KERNEL:** Compile kernel with explicit SIMD intrinsics.
+Best supported are ICC (deprecated legacy Intel compiler) and ICX (LLVM based
+Intel compiler). Choose NVCC to enable CUDA GPU kernels.
 
-Configurations for GROMACS MxN optimization scheme:
+The toolchain settings are located in the `./make` directory. Review the
+settings for the configured toolchain. You can configure different settings in
+`config.mk`, for starters on a X86 based system the defaults are fine.
 
-- **USE\_REFERENCE\_VERSION:** Use reference version (only for correction purposes).
-- **XTC\_OUTPUT:** Enable XTC output.
+To build the binary call, don't forget to load the compiler module on the
+NHR@FAU clusters (e.g. `module load intel`):
 
-Configurations for CUDA:
+```shell=
+make
+```
 
-- **USE\_CUDA\_HOST\_MEMORY:** Use CUDA host memory to optimize host-device transfers.
+While the Makefile works with any version of GNU make, some features require GNU
+make > v4.
+
+You can run MD-Bench without any arguments:
+
+```shell=
+./MDBench-VL-ICX-X86-AVX2-DP
+```
 
-When done, just use `make` to compile the code.
-You can clean intermediate build results with `make clean`, and all build results with `make distclean`.
-You have to call `make clean` before `make` if you changed the build settings.
+## Build system
 
-## Usage
+MD-Bench uses a Makefile with pattern rules and automatic dependency generation.
+If you add source files you do not need to change the Makefile as long as the
+sources are placed either in the `./src/verletlist/`, `./src/clusterpair/` or
+`./src/common` directories. If you change a file, all object files that depend
+on it are rebuild.
 
-Use the following command to run a simulation:
+All configuration variables can be overwritten from the command line, e.g. to
+build with ICC without changing `./config.mk` build with:
 
-```bash
-./MD-Bench-<TAG>-<OPT_SCHEME> [OPTION]...
+```shell=
+make TOOLCHAIN=ICC
 ```
 
-Where `TAG` and `OPT_SCHEME` correspond to the building options with the same
-name. Without any options, a Copper FCC lattice system with size 32x32x32
-(131072 atoms) over 200 time-steps using the Lennard-Jones potential (sigma=1.0,
-epsilon=1.0) is simulated.
-
-The default behavior and other options can be changed using the following parameters:
-
-```sh
--p <string>:          file to read parameters from (can be specified more than once)
--f <string>:          force field (lj or eam), default lj
--i <string>:          input file with atom positions (dump)
--e <string>:          input file for EAM
--n / --nsteps <int>:  set number of timesteps for simulation
--nx/-ny/-nz <int>:    set linear dimension of systembox in x/y/z direction
--r / --radius <real>: set cutoff radius
--s / --skin <real>:   set skin (verlet buffer)
---freq <real>:        processor frequency (GHz)
---vtk <string>:       VTK file for visualization
---xtc <string>:       XTC file for visualization
+Multiple configurations can be build at the same time. Every configuration has a
+unique binary name `./MDBENCH-<build tag>`. Intermediate build results are
+located in a `./build/build-<build tag>/` directory.
+
+All make targets act on the current configuration set in `./config.mk`, but this
+can be of course overwritten on the command line.
+
+Supported make targets:
+
+- `make`: Build the binary for current configuration.
+- `make clean`: Remove intermediate build results.
+- `make distclean`: Remove intermediate build results and binary. Also removes
+generated tags and clangd files, more on that later.
+- `make cleanall`: Remove all generated files. **Note**: This target applies to
+all configurations.
+- `make info` Output compiler version, useful for logging in automated benchmark
+scripts.
+- `make asm`: Generate assembly output of all source files. The assembly files
+are placed in the intermediate build directory.
+- `make format`: Reformat all source files with `clang-format` using the format
+specification in `.clang-format`.
+
+### Build time options
+
+- `TOOLCHAIN`: Determines which toolchain makefile is included
+- `ISA`: No usage apart from tag strings
+- `SIMD`: Controls the generation of intrinsic kernels for clusterpair
+- `OPT_SCHEME`: Algorithmic variant (verletlist or clusterpair), different
+source directories and main routines are used
+- `ENABLE_LIKWID`: Turn on LIKWID instrumentation, the LIKWID library has to be available
+- `DATA_TYPE`: Switch between single precision and double precision floating
+point. This is controlled by defines.
+- `DATA_LAYOUT`: Switch between array-of-structure (AOS) and structure-of-array
+(SOA) layout for atom positions and forces. Tradeoff between better cache
+utilisation and easier SIMD vectorization.
+- `DEBUG`: Enable additional debug output
+- `SORT_ATOMS`: Resort atoms to ensure that atoms that are nearby are also close
+to each other in the data structures
+- `EXPLICIT_TYPES`: Default the atom properties are stored in scalar variables.
+This option enables to support multiple atom types with different properties.
+- `ENABLE_OMP_SIMD`: This enforces the use of `#pragma omp simd` for the
+verletlist half-neighbour list force kernel. Without is the Intel compiler (at
+least ICC) refuses to do SIMD vectorization.
+- `USE_REFERENCE_VERSION`: Enforce usage of C implementation for clusterpair
+algorithm for validation
+- `USE_CUDA_HOST_MEMORY`: Enable pinned host memory for faster host-device transfers
+- `ENABLE_MPI:` Turn on the MPI parallel version of the code
+
+### Build for GPU targets
+
+MD-Bench currently only supports Nvidia GPUs using CUDA kernels. To enable CUDA
+kernels you need to specify `NVCC` as toolchain. The CUDA source code is in the
+same source directories with Cuda suffix and `.cu` as file type ending. If
+`NVCC` is set as toolchain, all supported kernels are automatically set to their
+CUDA variants at build time. This means a binary either supports CPU kernels or
+GPU kernels.
+
+## Command line arguments
+
+MD-Bench can be executed without any arguments, in this case the full neighbor
+list testcase with LJ force will be computed for 200 steps and a size of
+32x32x32 unit cells.
+
+- `-p / --params <string>`: file to read parameters from (can be specified more
+than once). Default initialization sets parameters for default LJ testcase.
+*`-f <string>`: force field (lj, eam), default lj. For anything different than
+lj you also need to provide spcific parameter file.
+- `-i <string>`:  input file with atom positions (dump). MD-Bench supports
+Brookhaven protein data bank (.pdb), GROMACS GROMOS87 (.gro), and LAMMPS dump
+(.dmp) file formats
+- `-e <string>`:  input file for EAM parameters
+- `-n / --nsteps <int>`:  set number of timesteps for simulation (default 200)
+- `-nx/-ny/-nz <int>`:  set linear dimension of systembox in x/y/z direction
+(default 32 in every dimension)
+- `-half <int>`: use half (1) or full (0) neighbor lists (default 0 - full
+neighbor list)
+- `-r / --radius <real>`:   set cutoff radius (default 2.5)
+- `-s / --skin <real>`:   set skin (verlet buffer, default 0.3)
+- `-w <file>`:  write input atoms to file
+- `--freq <real>`:  processor frequency (GHz), used to calculate cycle metrics
+(default 2.4)
+- `--vtk <string>`:    VTK output file for visualization
+
+## Available testcases
+
+For all variants you can switch between single precision and double precision
+and between AOS versus SOA data layouts using build time options. You can use
+the half neighbour list algorithm instead of the default full neighbour list by
+setting `-half 1`. To enforce SIMD vectorization for the half neighbour list
+algorithm you can set the option `ENABLE_OMP_SIMD=true`.
+
+### Lennard-Jones potential for solid copper
+
+Just start without any command line argument, this is the default testcase. You
+may change the number of timesteps using the `-n` options and change the problem
+size using the `-nz, -ny, -nz` options.
+
+### EAM potential for solid copper
+
+Call MD-Bench as follows:
+
+```shell=
+./MDBench-<TAG> -n 400 -f eam -e ./data/Cu_u3.eam
 ```
 
-## Examples
+Two different EAM variants are available: `Cu_u3.eam` and `Cu_u6.eam`. The EAM
+potential is currently only available for verletlist.
+
+### Lennard-Jones potential for melted copper
+
+The melted copper testcase has only 32000 atoms in the default configuration.
+Call MD-Bench as follows:
+
+```shell=bash
+./MDBench-<TAG> -n 400  -i ./data/copper_melting/input_lj_cu_one_atomtype_20x20x20.dmp
+```
+
+### Lennard-Jones potential for melted copper with explicit types
+
+Compile MD-Bench with `EXPLICIT_TYPES=true` in `config.mk`.
 
-TBD
+Call MD-Bench as follows:
+
+```shell=bash
+./MDBench-<TAG> -n 400  -i ./data/copper_melting/input_lj_cu_one_atomtype_20x20x20.dmp
+```
+
+**This testcase currently segvaults!**
+
+### EAM potential for melted copper
+
+Call MD-Bench as follows:
+
+```shell=bash
+./MDBench-<TAG> -n 400 -f eam -e ./data/Cu_u3.eam  -i ./data/copper_melting/input_eam_cu_one_atomtype_20x20x20.dmp
+```
+
+Two different EAM variants are available: `Cu_u3.eam` and `Cu_u6.eam`. The EAM
+potential is currently only available for verletlist.t.
+
+### Lennard-Jones potential for argon gas
+
+Call MD-Bench as follows:
+
+```shell=bash
+./MDBench-<TAG> -i ./data/argon/input.gro  -p ./data/argon/mdbench_params.conf
+```
 
 ## Citations
 
-Rafael Ravedutti Lucio Machado, Jan Eitzinger, Jan Laukemann, Georg Hager, Harald
-Köstler and Gerhard Wellein: MD-Bench: A performance-focused prototyping harness for
-state-of-the-art short-range molecular dynamics algorithms. Future Generation
-Computer Systems ([FGCS](https://www.sciencedirect.com/journal/future-generation-computer-systems)), Volume 149, 2023, Pages 25-38, ISSN 0167-739X, DOI:
+Rafael Ravedutti Lucio Machado, Jan Eitzinger, Jan Laukemann, Georg Hager,
+Harald Köstler and Gerhard Wellein: MD-Bench: A performance-focused prototyping
+harness for state-of-the-art short-range molecular dynamics algorithms. Future
+Generation Computer Systems
+([FGCS](https://www.sciencedirect.com/journal/future-generation-computer-systems)),
+Volume 149, 2023, Pages 25-38, ISSN 0167-739X, DOI:
 [https://doi.org/10.1016/j.future.2023.06.023](https://doi.org/10.1016/j.future.2023.06.023)
 
 Rafael Ravedutti Lucio Machado, Jan Eitzinger, Harald Köstler, and Gerhard