esm_runscripts does not stop on model crashes #179

seb-wahl · 2021-08-19T11:37:55Z

If the model (e.g. echam or any component of the coupled setup) crashes in the fortran code, e.g.

569: echam6             0000000000AD8051  MAIN__                    270  echam6.f90
 569: echam6             00000000004178E2  Unknown               Unknown  Unknown
 569: libc-2.17.so       00002AAAAE4AC555  __libc_start_main     Unknown  Unknown
 569: echam6             00000000004177E9  Unknown               Unknown  Unknown
srun: error: gcn2459: task 473: Exited with exit code 66
srun: error: gcn2443: tasks 12,36,60,84,108,132,156,180,204,228,276,372,396,420,444,468,564: Exited with exit code 66
srun: error: gcn2445: tasks 13,37,61,85,109,133,157,181,205,229,277,373,397,421,445,469,493,565: Exited with exit code 66
srun: error: gcn2447: tasks 14,38,62,86,110,134,158,182,206,230,254,398,410,422,434,446,458,470,494,566: Exited with exit code 66
srun: error: gcn2449: tasks 15,27,39,63,87,111,135,147,159,171,183,207,219,231,255,387,399,411,423,435,447,471,495,567: Exited with exit code 66
srun: error: gcn2455: tasks 16,40,64,76,88,100,112,124,136,148,160,172,184,196,208,220,232,244,256,364,388,400,412,424,436,448,472,484,496,568: Exited with exit code 66
srun: error: gcn2469: tasks 10,22,34,46,58,70,82,94,106,130,142,154,166,178,190,202,214,226,238,370,394,418,442,454,466,478,490,502,526,562,574: Exited with exit code 66
srun: error: gcn2471: tasks 11,23,35,47,59,71,83,95,107,131,155,167,179,191,203,215,227,239,275,371,395,419,443,455,467,479,491,503,527,563: Exited with exit code 66
srun: error: gcn2465: tasks 8,20,32,44,56,68,80,92,104,128,140,152,164,176,188,200,212,224,236,248,320,344,368,392,416,440,452,464,476,500,524,548,560,572: Exited with exit code 66
srun: error: gcn2463: tasks 7,19,31,43,55,67,79,91,103,115,127,139,151,163,175,187,199,211,223,235,247,271,319,343,367,391,415,427,439,451,463,475,499,523,571: Exited with exit code 66
srun: error: gcn2461: tasks 6,18,30,42,54,66,78,90,102,114,126,138,150,162,174,186,198,210,222,234,246,270,294,318,342,366,390,414,426,438,450,462,474,486,498,522,534,570: Exited with exit code 66
   0: slurmstepd: error: *** STEP 2965369.0 ON gcn2443 CANCELLED AT 2021-08-19T12:23:10 ***
 691: forrtl: error (78): process killed (SIGTERM)
 691: Image              PC                Routine            Line        Source
 691: oceanx             00000000017C6574  Unknown               Unknown  Unknown
 691: libpthread-2.17.s  00002AAAADE18630  Unknown               Unknown  Unknown

esm_runscripts continues and tries to move files, set's up the next leg of the run etc. In bash something like

if [[ $? -ne 0 ]] ; then
   tell me there is an error and stop
fi

(of course we need to handle echam's possible return code of 127) would do the trick. This has been an issue for us ever since and can be annoying at times.
Is there a way we can solve this in esm_runscripts?

The text was updated successfully, but these errors were encountered:

mandresm · 2021-08-19T11:42:47Z

I thought this was solved quite some time ago. Can you please provide the versions and the machine?

denizural · 2021-08-19T12:41:13Z

I am not sure but probably esm-runscript does not see that model has crashed and thinks that it needs to start a new simulation.

I actually developed a better runtime scheduler for that which checks the slurm output and decides if it needs to submit the new job or to kill eveything. Ie. think of it as echo $?

Unfortunately some system admins block the access to sacct command of Slurm (eg. ollie), so it will not be available on all systems.

@seb-wahl, do you think that your problem is related to this issue. Then I can bring this to the table on Thursday's meeting.

seb-wahl · 2021-08-20T06:40:22Z

Yes that's exactly something I'm looking for. An equivalent of if [[ $? -ne 0 ]] ; then do stop esm runscripts and tell the user that his run has crashed. We can discuss the details next week.

pgierz · 2021-08-20T07:08:47Z

We solved the case of 127 exits in ECHAM by just changing it to 0. While I am not a fan of changing the model to fit the infrastructure (in my view, you shouldn't need to reprogram your model to make the rest work), it was the fastest solution.

mandresm · 2021-08-20T07:37:04Z

I am still surprised that ESM-Tools moves forward after the model crashes. If you look at the following lines in slurm.yaml you'll see that the srun: error: is contemplated there which means it should be stopping.
https://github.com/esm-tools/esm_tools/blob/bbfed3ff5c63135a02bfe216e3e4881d1d15a360/configs/other_software/batch_system/slurm.yaml#L98-L109

I think is worth investigating it but I need the versions for that.

seb-wahl · 2021-08-20T08:17:56Z

I guess it's because of frequency: 600.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esm_runscripts does not stop on model crashes #179

esm_runscripts does not stop on model crashes #179

seb-wahl commented Aug 19, 2021

mandresm commented Aug 19, 2021

denizural commented Aug 19, 2021

seb-wahl commented Aug 20, 2021

pgierz commented Aug 20, 2021

mandresm commented Aug 20, 2021

seb-wahl commented Aug 20, 2021

esm_runscripts does not stop on model crashes #179

esm_runscripts does not stop on model crashes #179

Comments

seb-wahl commented Aug 19, 2021

mandresm commented Aug 19, 2021

denizural commented Aug 19, 2021

seb-wahl commented Aug 20, 2021

pgierz commented Aug 20, 2021

mandresm commented Aug 20, 2021

seb-wahl commented Aug 20, 2021