tools/autotune_grid/README

INTRODUCTION
============

The files / programs in this directory can be used to automatically tune CP2K/Quickstep's
computational kernel routines (the Gaussian to Plane wave transformations)
for a particular architecture/compiler version.

The kernel generation is based on timing several (automatically generated, but equivalent) source variants of
of the integrate_fast.F collocate_fast.F files, with dedicated routines for products of Gaussians of a given l-quantum number.
The variant of the code that gives the correct results (guard against compiler bugs) in the shortest time
is retained to generate 'optimal' routines.

It also generates dedicated routines for the xyz_to_vab routine for a given l-quantum number.
All routines are glued together into a library (libgrid.a)

SIMPLE USAGE
=============================

optimal routines change from compiler to compiler (brands, versions, and options) and computer architectures. If you don't want to
autogenerate these files now, just use the defaults, they will do reasonable (currently within about 20% of the the best variants)
througout. The full process can take several hours depending on your hard- and software, and the selected options.

0) Unzip files all files that might be gzipped in this distribution

tar -xzvf data.tgz

1) Make sure the machine you use for timing is idle (i.e. other running jobs perturb the timings).

2) Edit the config.in and specify 'FC_comp', 'FCFLAGS_OPT' and 'FCFLAGS_NATIVE', where FC_comp is the compiler together with any option required for parsing Fortran free form file, FCFLAGS_OPT is the set of optimisation flags which will be used for building the optimised grid kernels, and FCFLAGS_NATIVE is the flags used to build helper/generator programs - if you are using a cross-compiler, these flags should be used to target the local 'native' architecture e.g.

FC_comp="gfortran -ffree-form"
FCFLAGS_OPT=" -O3 -ftree-vectorize"
FCFLAGS_NATIVE="-march=native"

The highest optimization levels do not always give better results, use reasonable settings, and include the flags needed to
generate code for your target processor (e.g. -xT) especially if this can lead to vectorization (e.g. SSE). The compiler flags must 
be such that the compiler preprocesses files and interpretes them as free form Fortran95)

3) You can optionally edit the file config.in to specify the maximum l-quantum number for which to generate optimized routines (first line), 
and the number of times a test is repeated in order to get accurate timings (second line).
Recommended are 7 (for l, products of f functions and their derivative) and 3 (for the number of times to run the test).

4) Generate the makefile for the tests by typing:

./generate_makefile.sh

4) Start the process by typing:

make 

5) Once the job completed (might take several hours) you'll have generated the optimal kernel source files in the out_best directory. 

integrate_fast.F
collocate_fast.F
call_integrate.f90
call_collocate.f90

6) You can pack them into libgrid.a by typing:

make libgrid.a

7) When compiling CP2K, you can specify this libgrid.a file in the linking stage and define the macro -D__HAS_LIBGRID to use the routines included in that file instead of the default ones


ADVANCED USAGE
========================================

1) It is possible to speedup compilation by performing a parallel makefile. Just type:

make -j $NUM_PARALLEL_TASKS

where $NUM_PARALLEL_TASKS is usually the number of available cores. 
Be careful when running parallel executions. System noise might produce inaccurate results in some cases.
The safest way is to do the compilation stage in parallel but test each case separately.

make -j $NUM_PARALLEL_TASKS all_gen
make all_run
make gen_best

2) It is possible also to split the generation in two phases, a compilation phase which generates the code for all options, and a benchmarking phase that checks which one is the best combination for each problem size and generates the combination.

make all_gen        # Generates all testcases, can be run in a login node
make all_run        # Run all test cases, better run in a compute node
make gen_best       # Generate the best combination and store it on out_best
make libgrid.a      # Generates the library from the best combination


DETAILED INFORMATION AND TROUBLESHOUTING
========================================

1) Compiler problems .... 
a) some compilers with some options require hours to compile the generated files (despite the fact that they are
   rarely more than a few thousand lines). If you observe this, you'll need to reduce (or sometimes increase)
   the optimization level (e.g. -O3 -> -O2), as hundreds of compilation are required in total.
b) some compilers might generate really slow code for some source variants, making them run 100 times slower than the default, this can slow down
   the tuning process more than what is acceptable.  You'll need to avoid generating these code variants by modifying the sources. 
   See the file 'options.f90' near the lines 36-39.
c) some compilers might miscompile some code variants for certain compiler options. We try to guard against this by comparing against reference results.
   If you observe that a compiler miscompiles the code (mentioned in the output), we suggest you to use different options to compile CP2K and these routines.


2) What the source files are:
a) the program 'test.x' is basically a driver to call Quickstep's kernel routines with realistic data. 
   Most files are just simplified version of what you'll find in QS as well
b) generate.x is based on options.f90 and generate.f90, for each of the valid options (as determined by options.f90) generate.f90 will write a variant of the
   kernel routines. Currently, for the three main loops of the kernel, it will try explicit unrolling and to use vector notation. The latter tends to trigger
   vectorization with e.g. ifort.

3) In order to add source variants to be checked, you'll need to add it to the available options in options.f90 and implement it in generate.f90, all other things should follow automatically.  The total number of options has to be updated in the config.in and in the options.f90 module (value of the variable total_nopt).