Skip to content

Building GridPACK on Cori

Sumathi Lakshmiranganatha edited this page Aug 25, 2022 · 3 revisions

First, copy the gpbuild folder into the home directory:
rsync -rl --info=progress2 /global/common/software/m3363/gpbuild ~
There shouldn't be any need to edit any of the files in ~/gpbuild after copying, as they should be setup to install into $HOME/gpbuild.

Build process

First, configure the environment:

cd ~/gpbuild
source module_csh

module list should print something like this:

  1) modules/3.2.11.4                                  9) cray-libsci/19.06.1                              17) dvs/2.12_2.2.167-7.0.1.1_17.11__ge473d3a2
  2) altd/2.0                                         10) udreg/2.3.2-7.0.1.1_3.61__g8175d3d.ari           18) alps/6.6.58-7.0.1.1_6.30__g437d88db.ari
  3) darshan/3.2.1                                    11) ugni/6.0.14.0-7.0.1.1_7.63__ge78e5b0.ari         19) rca/2.2.20-7.0.1.1_4.74__g8e3fb5b.ari
  4) craype-network-aries                             12) pmi/5.0.14                                       20) atp/2.1.3
  5) craype-haswell                                   13) dmapp/7.1.1-7.0.1.1_4.72__g38cf134.ari           21) PrgEnv-gnu/6.0.5
  6) cray-mpich/7.7.10                                14) gni-headers/5.0.12.0-7.0.1.1_6.46__g3b1768f.ari  22) boost/1.69.0
  7) gcc/8.3.0                                        15) xpmem/2.2.20-7.0.1.1_4.28__g0475745.ari          23) cmake/3.21.3
  8) craype/2.6.2                                     16) job/2.2.4-7.0.1.1_3.55__g36b56f4.ari             24) git/2.21.0

1. GA build

cd ~/gpbuild/ga-5.8.1/
mkdir GA_shared
./configure --with-mpi-ts --disable-f77 --without-blas --enable-cxx --enable-i4 --prefix="$HOME/gpbuild/GA_shared" --enable-shared
make
make install

2. PETSc

cd ~/gpbuild/petsc-3.7.6
make clean
chmod +x ./build_csh
./build_csh
sbatch --wait build.job # Produces reconfigure-cori-gnu-cxx-complex-opt.py
python2 reconfigure-cori-gnu-cxx-complex-opt.py
make all

Modify build_csh script in Gridpack directory to update the library paths

rm -rf CMake*

export BOOST_DIR=/usr/common/software/boost/1.69.0/gnu/haswell
export LD_LIBRARY_PATH=$BOOST_DIR/lib:$LD_LIBRARY_PATH
export CMAKE_PREFIX_PATH=$BOOST_DIR/:$CMAKE_PREFIX_PATH
export BOOST_ROOT=$BOOST_DIR
export BOOST_INC=$BOOST_DIR/include
export BOOST_LIB=$BOOST_DIR/lib

CFLAGS='-L/opt/cray/xpmem/default/lib64 -lxpmem -L/opt/cray/ugni/default/lib64 -lugni -L/opt/cray/udreg/default/lib64 -ludreg -L/opt/cray/pe/pmi/default/lib64 -lpmi' \
cmake -Wdev \
-D BOOST_ROOT:STRING='/usr/common/software/boost/1.69.0/gnu/haswell' \
-D CMAKE_TOOLCHAIN_FILE:STRING=$HOME/gpbuild/GridPACK/src/build/ToolChain.cmake \
-D PETSC_DIR:STRING="$HOME/gpbuild/petsc-3.7.6" \
-D PETSC_ARCH:STRING='cori-gnu-cxx-complex-opt' \
-D BUILD_GA:BOOL=ON \
-D GA_INFINIBAND:BOOL=ON \
-D CMAKE_INSTALL_PREFIX:PATH="$HOME/gpbuild/GridPACK/src/gridpack-install" \
-D BUILD_SHARED_LIBS:BOOL=ON \
-D MPI_CXX_COMPILER:STRING='CC' \
-D MPI_C_COMPILER:STRING='cc' \
-D MPIEXEC:STRING='srun' \
-D CHECK_COMPILATION_ONLY:BOOL=true \
-D ENABLE_CRAY_BUILD:BOOL=true \
-D USE_PROGRESS_RANKS:BOOL=false \
-D CMAKE_BUILD_TYPE:STRING='RELWITHDEBINFO' \
-D CMAKE_VERBOSE_MAKEFILE:STRING=TRUE \
..

3. GridPACK

cd $HOME/gpbuild/GridPACK/src/build
rm -rf ../gridpack-install
mkdir ../gridpack-install
make clean
rm -rf CMake*
./build_csh
make; make install

This will install GridPACK to $HOME/gpbuild/GridPACK/src/gridpack-install

3. GridPACK Python Wrapper

cd ~/gpbuild/GridPACK/python
GRIDPACK_DIR=$HOME/gpbuild/GridPACK/src/gridpack-install
export GRIDPACK_DIR
unset RHEL_OPENMPI_HACK
rm -rf build/*
rm -rf *.so
rm -rf gridpack_hadrec.egg-info/
python setup.py build
mkdir $GRIDPACK_DIR/lib/python
PYTHONPATH="${GRIDPACK_DIR}/lib/python:${PYTHONPATH}"
export PYTHONPATH
python setup.py install --home="$GRIDPACK_DIR"

Tests

PETSc (seems to be working)

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell -A m3363
cd ~/gpbuild/petsc-3.7.6/src/snes/examples/tutorials
export PETSC_DIR=$HOME/gpbuild/petsc-3.7.6
export PETSC_ARCH=cori-gnu-cxx-complex-opt 
make ex1
make ex3

Output of srun -n 1 ./ex1 -ksp_gmres_cgs_refinement_type refine_always -snes_monitor_short

  0 SNES Function norm 6.04152 
  1 SNES Function norm 4.78676 
  2 SNES Function norm 2.98646 
  3 SNES Function norm 0.230624 
  4 SNES Function norm 0.00193631 
  5 SNES Function norm 1.43559e-07 
  6 SNES Function norm < 1.e-11
Number of SNES iterations = 6

Output of srun -n 3 ./ex3 -nox -pc_type asm -mat_type mpiaij -snes_monitor_cancel -snes_monitor_short -ksp_gmres_cgs_refinement_type refine_always:

 srun -n 3 ./ex3 -nox -pc_type asm -mat_type mpiaij -snes_monitor_cancel -snes_monitor_short -ksp_gmres_cgs_refinement_type refine_always 
atol=1e-50, rtol=1e-08, stol=1e-08, maxit=50, maxf=10000
  0 SNES Function norm 5.41468 
  1 SNES Function norm 0.295258 
  2 SNES Function norm 0.000450229 
  3 SNES Function norm 1.38967e-09 
Number of SNES iterations = 3
Norm of error 1.49751e-10 Iterations 3

GridPACK (seems to be working)

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell -A m3363
cd ~/gpbuild/GridPACK/src/build/applications/powerflow

 

Output:


### GridPACK math module configured on 3 processors                                
                                                                               
Maximum number of iterations: 50                                               
                                                                               
Convergence tolerance: 0.000001                                                
0: I have 46 buses and 59 branches                                             
1: I have 49 buses and 68 branches                                             
2: I have 45 buses and 63 branches                                             
 repeat time = 1                                                               
                                                                               
----------test Iteration 0, before PF solve, Tol: 9.638903e-02                 
  Residual norms for nrs_ solve.                                               
  0 KSP Residual norm 7.190983016757e-03                                       
  1 KSP Residual norm 1.347944633023e-03                                       
  2 KSP Residual norm 4.535626750145e-04                                       
  3 KSP Residual norm 2.917341728681e-04                                       
  4 KSP Residual norm 1.647921238149e-04                                       
  5 KSP Residual norm 7.897452726535e-05                                       
  6 KSP Residual norm 6.607054072768e-05                                       
  7 KSP Residual norm 5.371399518110e-05                                       
  8 KSP Residual norm 2.813124046284e-05                                       
  9 KSP Residual norm 1.594051825451e-05                                       
 10 KSP Residual norm 1.064548544847e-05                                       
 11 KSP Residual norm 8.490960299617e-06                                       
 12 KSP Residual norm 6.180701866024e-06                                       
 13 KSP Residual norm 4.020683912845e-06                                       
 14 KSP Residual norm 1.398606474076e-06                                       
 15 KSP Residual norm 8.626097626373e-07                                       
 16 KSP Residual norm 7.364346571727e-07                                       
 17 KSP Residual norm 6.848053732285e-07                                       
 18 KSP Residual norm 6.630520485578e-07                                       
 19 KSP Residual norm 6.560101273664e-07                                       
 20 KSP Residual norm 6.408068771526e-07                                       
 21 KSP Residual norm 5.938572278848e-07                                       
 22 KSP Residual norm 4.248234917688e-07                                       
 23 KSP Residual norm 2.594885010890e-07                                       
 24 KSP Residual norm 1.456851302217e-07                                       
 25 KSP Residual norm 7.026809525820e-08                                       
 26 KSP Residual norm 3.712644895932e-08                                       
 27 KSP Residual norm 1.436436895894e-08                                       
 28 KSP Residual norm 7.357363353615e-09                                       
KSP Object:(nrs_) 3 MPI processes                                              
  type: gmres                                                                  
    GMRES: restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    GMRES: happy breakdown tolerance 1e-30                                     
  maximum iterations=50                                                        
  tolerances:  relative=1e-12, absolute=1e-08, divergence=10000.               
  left preconditioning                                                         
  using nonzero initial guess                                                  
  using PRECONDITIONED norm type for convergence test                          
PC Object:(nrs_) 3 MPI processes                                               
  type: bjacobi                                                                
    block Jacobi: number of blocks = 3                                         
    Local solve is same for all blocks, in the following KSP and PC objects:   
  KSP Object:  (nrs_sub_)   1 MPI processes                                    
    type: preonly                                                              
    maximum iterations=10000, initial guess is zero                            
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.             
    left preconditioning                                                       
    using NONE norm type for convergence test                                  
  PC Object:  (nrs_sub_)   1 MPI processes                                     
    type: ilu                                                                  
      ILU: out-of-place factorization                                          
      0 levels of fill                                                         
      tolerance for zero pivot 2.22045e-14                                     
      matrix ordering: natural                                                 
      factor fill ratio given 1., needed 1.                                    
        Factored matrix follows:                                               
          Mat Object:           1 MPI processes                                
            type: seqaij                                                       
            rows=71, cols=71                                                   
            package used to perform factorization: petsc                       
            total: nonzeros=453, allocated nonzeros=453                        
            total number of mallocs used during MatSetValues calls =0          
              using I-node routines: found 40 nodes, limit used is 5           
    linear system matrix = precond matrix:                                     
    Mat Object:     1 MPI processes                                            
      type: seqaij                                                             
      rows=71, cols=71                                                         
      total: nonzeros=453, allocated nonzeros=1065                             
      total number of mallocs used during MatSetValues calls =0                
        using I-node routines: found 40 nodes, limit used is 5                 
  linear system matrix = precond matrix:                                       
  Mat Object:   3 MPI processes                                                
    type: mpiaij                                                               
    rows=201, cols=201                                                         
    total: nonzeros=1347, allocated nonzeros=5908                              
    total number of mallocs used during MatSetValues calls =0                  
      using I-node (on process 0) routines: found 40 nodes, limit used is 5    
                                                                               
Iteration 1 Tol: 9.638903e-02                                                  
  Residual norms for nrs_ solve.                                               
  0 KSP Residual norm 3.313863268360e-05                                       
  1 KSP Residual norm 1.604771965118e-05                                       
  2 KSP Residual norm 1.199185979790e-05                                       
  3 KSP Residual norm 7.960609461082e-06                                       
  4 KSP Residual norm 6.033927451473e-06                                       
  5 KSP Residual norm 4.482037948986e-06                                       
  6 KSP Residual norm 1.262973288967e-06                                       
  7 KSP Residual norm 9.394684496620e-07                                       
  8 KSP Residual norm 6.955684736309e-07                                       
  9 KSP Residual norm 4.966270714777e-07                                       
 10 KSP Residual norm 3.801378639361e-07                                       
 11 KSP Residual norm 3.159321420023e-07

GridPACK Python Wrapper (seems to be working)

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell -A m3363
cd ~/gpbuild/GridPACK/python

GRIDPACK_DIR=$HOME/gpbuild/GridPACK/src/gridpack-install
export GRIDPACK_DIR
PYTHONPATH="${GRIDPACK_DIR}/lib/python/gridpack_hadrec-0.0.1-py3.9-linux-x86_64.egg:${PYTHONPATH}” // add egg’s path to pythonpath 
export PYTHONPATH
export PATH=$HOME/gpbuild/GridPACK/src/gridpack-install/bin:$PATH

Output of srun -n 3 python src/hello.py: (seems to be working)

hello.py: hello from process 0 of 3
hello.py: hello from process 1 of 3
hello.py: hello from process 2 of 3

GridPACK math module configured on 3 processors

Output of srun -n 3 python src/task_manager.py: (seems to be working)

process 0 of 3 executing task 0
process 0 of 3 executing task 3
process 0 of 3 executing task 6
process 0 of 3 executing task 9
process 0 of 3 executing task 12
process 0 of 3 executing task 15
process 0 of 3 executing task 18
process 0 of 3 executing task 21
process 0 of 3 executing task 24
process 0 of 3 executing task 27
process 0 of 3 executing task 30
process 0 of 3 executing task 33
process 0 of 3 executing task 36
process 0 of 3 executing task 39
process 0 of 3 executing task 42
process 0 of 3 executing task 45
process 0 of 3 executing task 48
process 0 of 3 executing task 51
process 0 of 3 executing task 54
process 0 of 3 executing task 57
process 0 of 3 executing task 60
process 0 of 3 executing task 63
process 0 of 3 executing task 66
process 0 of 3 executing task 69
process 0 of 3 executing task 72
process 0 of 3 executing task 75
process 0 of 3 executing task 78
process 0 of 3 executing task 81
process 0 of 3 executing task 84
process 0 of 3 executing task 87
process 0 of 3 executing task 90
process 0 of 3 executing task 93
process 0 of 3 executing task 96
process 0 of 3 executing task 99
process 1 of 3 executing task 2
process 1 of 3 executing task 4
process 1 of 3 executing task 8
process 1 of 3 executing task 11
process 1 of 3 executing task 14
process 1 of 3 executing task 17
process 1 of 3 executing task 20
process 1 of 3 executing task 23
process 1 of 3 executing task 26
process 1 of 3 executing task 29
process 1 of 3 executing task 32
process 1 of 3 executing task 35
process 1 of 3 executing task 38
process 1 of 3 executing task 41
process 1 of 3 executing task 44
process 1 of 3 executing task 47
process 1 of 3 executing task 49
process 1 of 3 executing task 52
process 1 of 3 executing task 55
process 1 of 3 executing task 58
process 1 of 3 executing task 61
process 1 of 3 executing task 64
process 1 of 3 executing task 67
process 1 of 3 executing task 70
process 1 of 3 executing task 73
process 1 of 3 executing task 76
process 1 of 3 executing task 80
process 1 of 3 executing task 83
process 1 of 3 executing task 86
process 1 of 3 executing task 89
process 1 of 3 executing task 92
process 1 of 3 executing task 95
process 1 of 3 executing task 98
process 2 of 3 executing task 1
process 2 of 3 executing task 5
process 2 of 3 executing task 7
process 2 of 3 executing task 10
process 2 of 3 executing task 13
process 2 of 3 executing task 16
process 2 of 3 executing task 19
process 2 of 3 executing task 22
process 2 of 3 executing task 25
process 2 of 3 executing task 28
process 2 of 3 executing task 31
process 2 of 3 executing task 34
process 2 of 3 executing task 37
process 2 of 3 executing task 40
process 2 of 3 executing task 43
process 2 of 3 executing task 46
process 2 of 3 executing task 50
process 2 of 3 executing task 53
process 2 of 3 executing task 56
process 2 of 3 executing task 59
process 2 of 3 executing task 62
process 2 of 3 executing task 65
process 2 of 3 executing task 68
process 2 of 3 executing task 71
process 2 of 3 executing task 74
process 2 of 3 executing task 77
process 2 of 3 executing task 79
process 2 of 3 executing task 82
process 2 of 3 executing task 85
process 2 of 3 executing task 88
process 2 of 3 executing task 91
process 2 of 3 executing task 94
process 2 of 3 executing task 97

GridPACK math module configured on 3 processors

Output of srun python setup.py test (seems to be working)

running test
Searching for nose
Best match: nose 1.3.7
Processing nose-1.3.7-py3.7.egg

Using /global/u2/t/tflynn/gpbuild/GridPACK/python/.eggs/nose-1.3.7-py3.7.egg
running egg_info
writing gridpack_hadrec.egg-info/PKG-INFO
writing dependency_links to gridpack_hadrec.egg-info/dependency_links.txt
writing top-level names to gridpack_hadrec.egg-info/top_level.txt
reading manifest file 'gridpack_hadrec.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'gridpack_hadrec.egg-info/SOURCES.txt'
running build_ext
-- Cray Programming Environment 2.6.2 C
-- NOTE: LOADEDMODULES changed since initial config!
-- NOTE: this may cause unexpected build errors.
-- Cray Programming Environment 2.6.2 CXX
-- pybind11 v2.4.3
statusGRIDPACK_HAVE_GOSS: OFF
statusGRIDPACK_GOSS_LIBRARY:
-- Configuring done
-- Generating done
-- Build files have been written to: /global/u2/t/tflynn/gpbuild/GridPACK/python/build/temp.linux-x86_64-3.7
[  0%] Built target parallel_scripts
Consolidate compiler generated dependencies of target gridpack
[ 50%] Linking CXX shared module ../../../gridpack.cpython-37m-x86_64-linux-gnu.so
[100%] Built target gridpack
hadrec_test (tests.gridpack_test.GridPACKTester) ... ok
hello_test (tests.gridpack_test.GridPACKTester) ... ok
task_test (tests.gridpack_test.GridPACKTester) ... ok

----------------------------------------------------------------------
Ran 3 tests in 4.924s

OK

GridPACK math module configured on 1 processors

powerGridEnv

Make a new conda environment:

conda create --name gpenv --clone base
conda activate gpenv
conda install -c conda-forge gym 
pip install xmltodict
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell -A m3363
cd ~/powerGridEnv/src 
conda activate gpenv

Output of srun python test_gridpack_env_39bus.py: (seems to be working)

/global/homes/t/tflynn/.conda/envs/gpenv/lib/python3.7/site-packages/gym/spaces/box.py:142: UserWarning: WARN: Casting input x to numpy array.
  logger.warn("Casting input x to numpy array.")
[(0, 0, 1.0, 0.1), (0, 1, 1.0, 0.1), (0, 2, 1.0, 0.1), (0, 3, 1.0, 0.1), (0, 4, 1.0, 0.1), (0, 5, 1.0, 0.1), (0, 6, 1.0, 0.1), (0, 7, 1.0, 0.1), (0, 8, 1.0, 0.1)]
-----------------root path of the rlgc: /global/u2/t/tflynn/powerGridEnv
!!!!!!!!!-----------------start the env
(7,)
------------------- Total steps: 50, Episode total reward without any load shedding actions:  -4601.344304438557
------------------- Total steps: 80, Episode total reward with manually provided load shedding actions:  -1472.2004908672507
volt_ob_noact.shape:  (51, 4)
--------- GridPACK HADREC APP MODULE deallocated ----------
!!!!!!!!!-----------------finished gridpack env testing

Output of srun python test_gridpack_env_300bus.py: (seems to be working)

/global/homes/t/tflynn/.conda/envs/gpenv/lib/python3.7/site-packages/gym/spaces/box.py:142: UserWarning: WARN: Casting input x to numpy array.
  logger.warn("Casting input x to numpy array.")
test_gridpack_env_300bus.py:238: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  plt.subplot(121)
test_gridpack_env_300bus.py:247: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  plt.subplot(122)
[(0, 0, 1.0, 0.1), (0, 1, 1.0, 0.1), (0, 2, 1.0, 0.1), (0, 3, 1.0, 0.1), (0, 4, 1.0, 0.1), (0, 5, 1.0, 0.1), (0, 6, 1.0, 0.1), (0, 7, 1.0, 0.1), (0, 8, 1.0, 0.1)]
-----------------root path of the rlgc: /global/u2/t/tflynn/powerGridEnv
!!!!!!!!!-----------------start the env
finished loading npz file
ob_dim:  142 ac_dim:  34
finished loading the weights to the policy network
Fault tuple is  (0, 0, 1.0, 0.08)
-----one episode testing finished without AI-provided actions, total steps: 80 total reward:  -3623.717146067013
------------------- Episode total reward with AI provided actions:  -3623.717146067013
Fault tuple is  (0, 0, 1.0, 0.08)
-----one episode testing finished without any load shedding, total steps: 61 total reward:  -14496.355926569628
------------------- Episode total reward without any action:  -14496.355926569628
--------- GridPACK HADREC APP MODULE deallocated ----------
!!!!!!!!!-----------------finished gridpack env testing

Running the RL code on one node

Note that to get the script ars_rand_faultandpf_cases_LSTM_gridpack_general.py working on Cori, we had to make a few changes. See the branch Cori/powerGridEnv of this repo for a working version. These changes seem to have been needed due for compatibility with the version of Ray that was available on Cori at the time of running this code.

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint haswell -A m3363
conda activate gpenv
srun python ars_rand_faultandpf_cases_LSTM_gridpack_general.py --cores 6 --n_iter 20

Partial output:

2022-01-27 19:41:13,564 INFO services.py:1173 -- View the Ray dashboard at http://127.0.0.1:8265
-----------------root path of the rlgc: /global/u2/t/tflynn/powerGridEnv
Trying to init with ip None and pw None...
Did the init!
This cluster consists of
        1 nodes in total
        64.0 CPU resources in total

Logging data to outputs_training/ars_39_bussys_1_pf_5_faultbus_1_dur_lstm_gridpack_v6/log.txt
Creating deltas table.
Created deltas table.
Initializing multidirection workers.
just set self.num_workers to 1

------------!!!! workers allocation:

------------!!!! total cores: 6  , total directions: 32 ,  onedirection_numofcasestorun:  5
------------!!!! num_workers: 1 ,  repeat:  32 , remain: 0

------------!!!!

Initializing policy.
Initializing optimizer.
Initialization of ARS complete.
Total Time Initialize ARS: 11.715039491653442
select_faultbuses_id:   [4 3 0 2 1]
select_pfcases_id:   [0]
select_fault_cases_tuples:   [(0, 4, 1.0, 0.08), (0, 3, 1.0, 0.08), (0, 0, 1.0, 0.08), (0, 2, 1.0, 0.08), (0, 1, 1.0, 0.08)]
rollout_rewards shape: (32, 2)
deltas_idx shape: (32,)
Maximum reward of collected rollouts: -1674.5928014215574
---Time to print rollouts results:  0.00016307830810546875
---Full Time to generate rollouts:  25.364495038986206

----time to aggregate rollouts:  0.018156766891479492
Euclidean norm of update step: 33.69602337952102
g_hat shape, w_policy shape: (6112,) (6112,)
total time of one step 25.386450052261353
iter  0  done
[('cores', 6), ('decay', 0.9985), ('delta_std', 2), ('deltas_used', 16), ('dir_path', 'outputs_training/ars_39_bussys_1_pf_5_faultbus_1_dur_lstm_gridpack'), ('n_directions', 32), ('n_iter', 20), ('onedirection_numofcasestorun', 5), ('policy_file', ''), ('policy_network_size', [32, 32]), ('policy_type', 'LSTM'), ('rollout_length', 90), ('save_per_iter', 10), ('seed', 589), ('step_size', 1), ('tol_p', 0.001), ('tol_steps', 100)]
-------------------------------------
|            Time |            25.6 |
|       Iteration |               1 |
|   AverageReward |       -3.87e+03 |
|       reward 0: |       -3.94e+03 |
|       reward 1: |       -3.77e+03 |
|       reward 2: |       -3.93e+03 |
|       reward 3: |       -3.93e+03 |
|       reward 4: |       -3.76e+03 |
|       timesteps |        5.12e+03 |
-------------------------------------
total time of save: 0.20161938667297363
select_faultbuses_id:   [2 3 4 0 1]
select_pfcases_id:   [0]
select_fault_cases_tuples:   [(0, 2, 1.0, 0.08), (0, 3, 1.0, 0.08), (0, 4, 1.0, 0.08), (0, 0, 1.0, 0.08), (0, 1, 1.0, 0.08)]
rollout_rewards shape: (32, 2)
deltas_idx shape: (32,)
Maximum reward of collected rollouts: -1380.3128771101492
---Time to print rollouts results:  0.00010728836059570312
---Full Time to generate rollouts:  11.247551679611206

----time to aggregate rollouts:  0.0005881786346435547
Euclidean norm of update step: 35.93837568283732
g_hat shape, w_policy shape: (6112,) (6112,)
total time of one step 11.251289129257202
iter  1  done
total time of save: 1.430511474609375e-06
select_faultbuses_id:   [3 1 0 4 2]
select_pfcases_id:   [0]
select_fault_cases_tuples:   [(0, 3, 1.0, 0.08), (0, 1, 1.0, 0.08), (0, 0, 1.0, 0.08), (0, 4, 1.0, 0.08), (0, 2, 1.0, 0.08)]
rollout_rewards shape: (32, 2)
deltas_idx shape: (32,)
Maximum reward of collected rollouts: -1046.8336988165634
---Time to print rollouts results:  0.00015091896057128906
---Full Time to generate rollouts:  12.794855833053589

----time to aggregate rollouts:  0.0006389617919921875
Euclidean norm of update step: 36.19825363812717
g_hat shape, w_policy shape: (6112,) (6112,)
total time of one step 12.798492431640625
iter  2  done
total time of save: 9.5367431640625e-07
select_faultbuses_id:   [3 4 2 1 0]
select_pfcases_id:   [0]
select_fault_cases_tuples:   [(0, 3, 1.0, 0.08), (0, 4, 1.0, 0.08), (0, 2, 1.0, 0.08), (0, 1, 1.0, 0.08), (0, 0, 1.0, 0.08)]
rollout_rewards shape: (32, 2)
deltas_idx shape: (32,)
Maximum reward of collected rollouts: -1800.67744519864
---Time to print rollouts results:  0.0001068115234375
---Full Time to generate rollouts:  11.743767976760864

----time to aggregate rollouts:  0.0005624294281005859
Euclidean norm of update step: 34.23786550032807
g_hat shape, w_policy shape: (6112,) (6112,)
total time of one step 11.747586965560913
iter  3  done
total time of save: 9.5367431640625e-07
select_faultbuses_id:   [0 3 2 4 1]
select_pfcases_id:   [0]
select_fault_cases_tuples:   [(0, 0, 1.0, 0.08), (0, 3, 1.0, 0.08), (0, 2, 1.0, 0.08), (0, 4, 1.0, 0.08), (0, 1, 1.0, 0.08)]
rollout_rewards shape: (32, 2)
...

Running the RL code on multiple nodes

Our scripts for running on multiple nodes are in the ./cori folder in this repo. They are based on the example Ray submission scripts from here https://github.com/NERSC/slurm-ray-cluster. The only changes to the start-worker.sh and start-head.sh scripts from the example are that we need to do conda activate gpenv in before executing the ray commands. See the commit 66bcbcc7a2f1777dcc308d966cc6d35677e09efd of this repository for a working configuration.

cd ~/powerGridEnv/cori
sbatch test_32.sh

References

  1. Official GridPACK docs: https://www.gridpack.org/wiki/index.php/How_to_Build_GridPACK
  2. Software Dependencies: https://www.gridpack.org/wiki/index.php/Software_Required_to_Build_GridPACK