Loop macros #39

johnomotani · 2022-01-10T19:21:58Z

This PR introduces the loop macros discussed in #35.

This new implementation of parallelization is more flexible - it allows any number of processes to be used, regardless of the number of species or spatial/velocity-space grid dimensions. The dimensions are divided up to give the 'best possible' load balance, with the restriction that each dimension is divided up independently (i.e. the ranges for inner loops are not allowed to depend on the indices of the outer loops).

There is a slight performance drop when evolve_upar == false. The previous implementation had a special case for looping just over ion species, which I have not implemented here to keep things as simple as possible, and assuming that the main target is with evolving upar and ppar anyway. This was used in vpa_advection!(), but the new implementation just tells some processes to continue if they are handling neutrals, when evolve_upar == false.

The implementation in looping.jl is a bit of a nightmare of code-generation, although it's only 375 lines. It was designed to be flexible enough that the only thing we ever have to change is the Tuple of dimensions const all_dimensions = (:s, :z, :vpa) near the top. Sets of loop ranges and macros for loops over any combination of those dimensions are then automatically generated. So hopefully we do not have to fight with it often in future.

Description (copied from README.md)

There are two types of loop macros:
- 1. If there is just a single nested loop, then for example
  @s_z_loop is iz begin for izpa in 1:vpa.n f[ivpa,iz,is] = ... end end
  The dimensions in the prefix before _loop give the dimensions that are looped over by the macro, the arguments before begin are the names of the loop variables, in the same order as the dimensions in the prefix; the first dimension/loop-variable corresponds to the outermost nested loop, etc.
- 2. For more complex loops, a separate macro can be used for each level, for example
  @s_z_loop_s is begin some_setup(is) @s_z_loop_z iz begin @views do_something(f[:,iz,is]) end @s_z_loop_z iz begin @views do_something_else(f[:,iz,is]) end end
  The dimensions in the prefix before _loop_ again give the dimensions that are looped over in the nested loop. The dimension in the suffix after _loop_ indicates which particular dimension the macro loops over. The argument before begin is the name of the loop variables.
The ranges used are stored in a LoopRanges struct in the Ref variable loop_ranges (which is exported by the looping module). Occasionally it is useful to access the range directly. For example the range looped over by the macro @s_z_loop_s is loop_ranges[].s_z_range_s (same prefix/suffix meanings as the macro).
- The square brackets [] are needed because loop_ranges is a reference to a LoopRanges object Ref{LoopRanges} (a bit like a pointer) - it allows loop_ranges to be a const variable, so its type is always known at compile time, but the actual LoopRanges can be set/modified at run-time.
It is also possible to run a block of code in serial (on just the rank-0 member of each block of processes) by wrapping it in a @serial_region macro. This is mostly useful for initialization or file I/O where performance is not critical. For example
@serial_region begin # Do some initialization f .= 0.0 end
In any loops with the same prefix (whether type 1 or type 2) the same points belong to each process, so several loops can be executed without synchronizing the different processes. It is (mostly) only when changing the 'type' of loop (i.e. which dimensions it loops over) that synchronization is necessary, or when changing from 'serial region(s)' to parallel loops. To aid clarity and to allow some debugging routines to be added, the synchronization is done with functions labelled with the loop type. For example begin_s_z_region() should be called before @s_z_loop or @s_z_loop_* is called, and after any @serial_region or other type of @*_loop* macro. begin_serial_region() should be called before @serial_region.
Internally, the begin_*_region() functions call _block_synchronize(), which calls MPI.Barrier(). When all debugging is disabled they are equivalent to MPI.Barrier(). Having different functions allow extra consistency checks to be done when debugging is enabled, see debug_test/README.md.

Closes #35, closes #38.

Generate macros for loops over any combination of dimensions, which get pre-generated ranges so that each type of loop is parallelized over the shared MPI arrays with optimal load balance.

Only functional difference between the different 'regions' is in debugging code, but regions provide structure showing where synchronization calls are needed.

Now replaced by the new implementation in looping.jl.

See JuliaParallel/MPI.jl#518.

If we are precompiling, presumably want an optimized, production build so always pass `-O3 --check-bounds=no` to the build process.

Parsing command line options in command_line_options.__init__() caused problems (command line arguments being ignored) when moment_kinetics was compiled into a static system image. [My guess is that __init__() was called too early in that case, before ARGS was set up properly.] Instead, define a getter function get_options(), which parses the command line arguments when it is called. This is slightly less efficient, but will only be called a few times, and should be more robust. Also allows changing the options, e.g. adding/removing "--long" from ARGS during a REPL session to change which tests are run.

Also test with 3 processes, as this is now supported by moment_kinetics.

johnomotani added 30 commits January 1, 2022 23:26

Add charge_exchange_frequency=0 case to debug_test

2b6e4ad

Fix synchronization in vpa_advection!() for CX_frequency=0.0 case

f7b276d

Update Manifest.toml files for julia-1.7.0

68fdc98

Make plot_performance.jl compatible with julia-1.7.0

73f3bb4

Nice macros for nested loops

f73b9a2

Generate macros for loops over any combination of dimensions, which get pre-generated ranges so that each type of loop is parallelized over the shared MPI arrays with optimal load balance.

Functions to begin each type of 'loop region'

e75b4a6

Only functional difference between the different 'regions' is in debugging code, but regions provide structure showing where synchronization calls are needed.

Temporarily alias _block_synchronize() to block_synchronize()

1391bfe

Update time_advance.jl with new loop macros

953db39

Update vpa_advection.jl with new loop macros

49d26f6

Update z_advection.jl with new loop macros

0b40837

Update source_terms.jl with new loop macros

eca2906

Update charge_exchange.jl with new loop macros

ba8f4a5

Update ionization.jl with new loop macros

1fe8ae2

Update continuity.jl with new loop macros

1d63abd

Update force_balance.jl with new loop macros

5583e7d

Update advection.jl with new loop macros

c9cb05b

Update energy_equation.jl with new loop macros

f1383b9

Update array_allocation.jl with new loop macros

ff5958b

Update bgk.jl with new loop macros

89283ae

Update em_fields.jl with new loop macros

5a72f13

Update file_io.jl with new loop macros

f0b2db8

Update initial_conditions.jl with new loop macros

7d25846

Update semi_lagrange.jl with new loop macros

17ef740

Update velocity_moments.jl with new loop macros

847b92f

Update moment_kinetics_input.jl with new loop macros

cebda1f

Update moment_kinetics.jl with new loop macros

793e9b4

Remove ranges used for first attempt at shared-memory parallelism

a234efc

Now replaced by the new implementation in looping.jl.

Update precompile scripts with new dependencies

d286b57

Workaround to make precompilation work with MPI

aa1304b

See JuliaParallel/MPI.jl#518.

Pass optimization arguments when building static system image

e3e787e

If we are precompiling, presumably want an optimized, production build so always pass `-O3 --check-bounds=no` to the build process.

johnomotani added 3 commits January 10, 2022 17:26

Run 'long' tests in parallel CI job

4936618

Also test with 3 processes, as this is now supported by moment_kinetics.

Update READMEs with new looping macros

c77e27b

johnomotani added the enhancement New feature or request label Jan 10, 2022

mabarnes merged commit b6f2082 into master Jan 12, 2022

johnomotani deleted the loop-macros branch January 26, 2022 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop macros #39

Loop macros #39

johnomotani commented Jan 10, 2022 •

edited

Loading

Loop macros #39

Loop macros #39

Conversation

johnomotani commented Jan 10, 2022 • edited Loading

Description (copied from README.md)

johnomotani commented Jan 10, 2022 •

edited

Loading