-
Notifications
You must be signed in to change notification settings - Fork 29
Running ChaNGa
ChaNGa accepts Tipsy files as initial conditions. The running of the program is controlled by either a parameter file or command line switches, in the style of PKDGRAV
. See the testcosmo
or the teststep
subdirectories for example parameter files. ChaNGa --help
will give a list of all available options. Their meaning is described in ChaNGa Options. ChaNGa can be run in parallel or in serial. Generally (depending on the architecture) to run in parallel requires starting ChaNGa with the charmrun
program. For
example
charmrun +p4 ./ChaNGa cube300.param
will start ChaNGa
on four processors using the cube300.param
parameter file.
Here is a more complicated example:
charmrun +p 4 ++local ./ChaNGa -wall 60 +balancer MultistepLB_notopo cube300.param
++local
means run all processes locally and ignore the network. -wall 60
means run for 60 minutes before checkpointing and stopping. +balancer MultistepLB_notopo
is specifying a load balancer.
SMP refers to Symmetric Multi-Processing, which means many cores on each compute node have access to the same memory space. The charm run time can take advantage of this access and use fewer messages, but the start command needs to be modified to tell ChaNGa about the processor configuration.
If charm is build with the smp
option to take advantage of SMP, then when ChaNGa is compiled the executables charmrun.smp
and ChaNGa.smp
are produced to indicate that SMP execution is compiled in. An example command line to run on 2 nodes with 48 SMP cores each will now look like:
charmrun.smp +p 94 ChaNGa.smp ++ppn 47 +setcpuaffinity +commap 0 +pemap 1-47 test.param
Although, in this example, there are a total of 96 cores available, each node needs one core for communication, so only 94 cores (the +p 94
argument) are available as "workers", 47 per node (the ++ppn 47
argument). Frequently, specifying the layout of the communication threads and worker threads on the cores help performance. Here the +setcpuaffinity +commap 0 +pemap 1-47
arguments specify a layout with a communication thread on core "0" and worker threads on cores 1 to 47.
Sometimes more than one communication thread is needed per node. In the following example, each of two nodes has two sockets, with each socket containing 64 cores. With this many cores, more than one communication thread is likely to be needed so the command would be:
charmrun.smp +p 252 ChaNGa.smp ++ppn 63 +setcpuaffinity +commap 0,64 +pemap 1-63,65-127 test.param
In this case, a total of 4 threads (2 per node) are being used for communication so only 252 of the 256 cores are available for computing. The ++ppn 63
indicates there will be 63 workers for each communication thread, and they will be laid out such that communication threads will be on cores 0 and 64 while the rest of the cores will be used for workers. Laying them out in this order allows each socket to have a communication thread and 63 associated workers. Note that the numbering of cores across sockets can vary by machine so check documentation for the best layout.
The "net" version of charm starts multiple processes by invoking ssh
; therefore an ssh server needs to be installed on the target machine. For example, on Redhat/Fedora machines the openssh-server
package needs to be installed. yum install openssh-server
will accomplish this. If you are using the "net" version to run on a single machine with multiple cores, the use of ssh
can be avoided by using the ++local
charmrun option. Also by default, ssh
requires you to enter your password. This can be avoided by setting up your ssh keys correctly. See the SSH with keys HOWTO for information on how to do this.
The GPU version is experimental.
The GPU version of ChaNGa offloads computation to the GPU in chunks called work requests (WR). The interaction of one bucket of particles with a node or another bucket of particles constitutes one unit of computation. Each WR can hold a certain, specified, number of force computations. An appropriate value for the WR size can be specified by the user.
There are several kinds of WR in ChaNGa. WRs that represent the computation between local buckets and local data (either nodes or other buckets) are referred to as 'local'. Similarly, WRs that specify computation of local buckets with remote prefetched data are termed 'remote'. Finally, WRs that specify interaction between local buckets and remote data that haven't been prefetched are termed 'remote-resume'.
ChaNGa provides the following parameters to assign a value for each type of WR:
Local WRs:
- -localnodes: bucket - local node computations to offload per WR
- -localparts: bucket - local bucket computations
- -remotenodes: bucket - remote node computations to offload per WR
- -remoteparts: bucket - remote bucket computations to offload
- -remoteresumenodes: bucket - remote-resume node computations to offload per WR
- -remoteresumeparts: bucket - remote-resume bucket computations to offload
Appropriate values can be obtained by the following mechanism:
- Recompile the ChaNGa CUDA version with -DCUDA_STATS in addition to the other CUDA-specific flags.
- This gives the per-iteration count of each type of interaction (localnodes, localparts, remotenodes, remoteparts, remoteresumenodes, remoteresumeparts).
- These values can be used to split the total number of interactions into as many pieces (WRs) as deemed appropriate. Some effort might be required to determine appropriate values in this fashion.
On MPI architectures, you have the option of building the MPI version of charm, and then charmrun
is just a shell script wrapper around whatever command is used to start MPI jobs (e.g poe
on IBM, mpirun
on mpich.) A typical launch command for an MPI job would be
mpiexec ./ChaNGa -wall 600 +balancer MultistepLB_notopo simulation.paramwhere 600 refers to the minutes of wallclock time requested from the queuing system and MultistepLB_notopo is the specified load balancer.
Another option on many infiniband clusters is to use the native infiniband support. See https://github.com/N-BodyShop/changa/wiki/Machine-Specific-Build-Instructions#Infiniband_Linux_cluster_lonestar_stampede_at_TACC_gordon_at_SDSC_Plieades_at_NAS instructions for details.
Many cray machines (xe, xk, and xc series) use aprun
to start parallel jobs. Like mpirun
, aprun
takes the place of charmrun
. See the aprun documentation to see how to specify the number of nodes, and the number of cores per node. An example is:
aprun -n 4 -N 1 -d 16 ChaNGa +ppn 15 cube300.paramto start ChaNGa on 4 nodes with one SMP process with 16 threads (15 workers, 1 communication) per node.
See appendix C of the CHARM language manual for more information on parallel execution. Also see Research:ChaNGaPerformanceAnalysis to evaluate how these options affect the parallel performance.
Outputs are also in TIPSY format and are in files that end with the timestep. For example to visualize the final output of the testcosmo simulation, fire up tipsy
, and type
openbinary cube300.000128 loadstandard 1.0 zallThis should display the clustering of galaxies on a 300 Mpc scale.
It is frequently the case that a simulation will take much more wall clock time than a batch queuing system will
allow. In this case, ChaNGa
can write checkpoints at regular step intervals (iCheckInterval
, and the simulation can be restarted in a subsequent batch submission from one of these checkpoints. A simulation can be restarted from a checkpoint using the syntax:
charmrun +p4 ./ChaNGa +restart cube300.chk0
where cube300.chk0
is an example restart directory. As ChaNGa
runs, it produces restart directories with suffixes alternating between .chk0
and .chk1
.
All parameters will be restored from the checkpoint directory. Only a small subset of the run parameters can be changed in a restart, and only by specifying the changes via command line arguments. These include the base timestep (-dt), the number of timesteps (-n), the wall clock time limit (-wall), the particles/bucket (-b), the output interval (-oi), and the checkpoint interval (-oc).
If a restart is needed that involves a substantial change in the run (e.g. changing the version of the code), this can be accomplished by restarting from an output file. In this case the parameter file should be edited such that the parameter achInFile
is now the output file from which you wish to restart, and the parameter iStartStep
is set to the step number of that file.
ChaNGa now (as of 7/2009) has on demand visualization capabilities via the liveViz module of CHARM++. To use it, set bLiveViz = 1
in the parameter file, and start ChaNGa with
charmrun +p4 ++server ++server-port NNNNN ./ChaNGa run.param
where NNNNN is an unused TCP port number. Images of the running simulation can be optained by using the liveViz java client from the CHARM++ distribution in java/bin/liveViz. The syntax is liveViz hostname NNNNN
where hostname
is the machine on which charmrun is running, and NNNNN is the port number given above. A window will pop up with an image that will continually be refreshed from the running program. The image view is controlled by the .director file. See Research:ChaNGaOptions#Movie_Making_options.
See Research:ChaNGaPerformanceAnalysis for tools to measure and improve the performance of ChaNGa.
An email list has been set up at [email protected]. Please subscribe to the list before posting to it.
Bugs and feature requests can be submitted to the NChilada product of our Redmine server.
Also check out our list of Research:ChaNGa Issues for common errors when running ChaNGa.
Internal code documentation using doxygen is partially done.
While there is no comprehensive body of documentation detailing the ChaNGa code, the recent refactoring efforts are outlined and discussed here. The refactoring process unearthed the answers to some nuances of the existing code as well, so one would do well to look through these articles.
The development of ChaNGa was supported by a National Science Foundation ITR grant PHY-0205413 to the University of Washington, and NSF ITR grant NSF-0205611 to the University of Illinois. Contributors to the program include Graeme Lufkin, Tom Quinn, Rok Roskar, Filippo Gioachin, Sayantan Chakravorty, Amit Sharma, Pritish Jetley, Lukasz Wesolowski, Edgar Solomonik, Celso Mendes, Joachim Stadel, and James Wadsley.