Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esm_runscripts does not stop on model crashes #179

Open
seb-wahl opened this issue Aug 19, 2021 · 6 comments
Open

esm_runscripts does not stop on model crashes #179

seb-wahl opened this issue Aug 19, 2021 · 6 comments

Comments

@seb-wahl
Copy link
Contributor

If the model (e.g. echam or any component of the coupled setup) crashes in the fortran code, e.g.

569: echam6             0000000000AD8051  MAIN__                    270  echam6.f90
 569: echam6             00000000004178E2  Unknown               Unknown  Unknown
 569: libc-2.17.so       00002AAAAE4AC555  __libc_start_main     Unknown  Unknown
 569: echam6             00000000004177E9  Unknown               Unknown  Unknown
srun: error: gcn2459: task 473: Exited with exit code 66
srun: error: gcn2443: tasks 12,36,60,84,108,132,156,180,204,228,276,372,396,420,444,468,564: Exited with exit code 66
srun: error: gcn2445: tasks 13,37,61,85,109,133,157,181,205,229,277,373,397,421,445,469,493,565: Exited with exit code 66
srun: error: gcn2447: tasks 14,38,62,86,110,134,158,182,206,230,254,398,410,422,434,446,458,470,494,566: Exited with exit code 66
srun: error: gcn2449: tasks 15,27,39,63,87,111,135,147,159,171,183,207,219,231,255,387,399,411,423,435,447,471,495,567: Exited with exit code 66
srun: error: gcn2455: tasks 16,40,64,76,88,100,112,124,136,148,160,172,184,196,208,220,232,244,256,364,388,400,412,424,436,448,472,484,496,568: Exited with exit code 66
srun: error: gcn2469: tasks 10,22,34,46,58,70,82,94,106,130,142,154,166,178,190,202,214,226,238,370,394,418,442,454,466,478,490,502,526,562,574: Exited with exit code 66
srun: error: gcn2471: tasks 11,23,35,47,59,71,83,95,107,131,155,167,179,191,203,215,227,239,275,371,395,419,443,455,467,479,491,503,527,563: Exited with exit code 66
srun: error: gcn2465: tasks 8,20,32,44,56,68,80,92,104,128,140,152,164,176,188,200,212,224,236,248,320,344,368,392,416,440,452,464,476,500,524,548,560,572: Exited with exit code 66
srun: error: gcn2463: tasks 7,19,31,43,55,67,79,91,103,115,127,139,151,163,175,187,199,211,223,235,247,271,319,343,367,391,415,427,439,451,463,475,499,523,571: Exited with exit code 66
srun: error: gcn2461: tasks 6,18,30,42,54,66,78,90,102,114,126,138,150,162,174,186,198,210,222,234,246,270,294,318,342,366,390,414,426,438,450,462,474,486,498,522,534,570: Exited with exit code 66
   0: slurmstepd: error: *** STEP 2965369.0 ON gcn2443 CANCELLED AT 2021-08-19T12:23:10 ***
 691: forrtl: error (78): process killed (SIGTERM)
 691: Image              PC                Routine            Line        Source
 691: oceanx             00000000017C6574  Unknown               Unknown  Unknown
 691: libpthread-2.17.s  00002AAAADE18630  Unknown               Unknown  Unknown

esm_runscripts continues and tries to move files, set's up the next leg of the run etc. In bash something like

if [[ $? -ne 0 ]] ; then
   tell me there is an error and stop
fi

(of course we need to handle echam's possible return code of 127) would do the trick. This has been an issue for us ever since and can be annoying at times.
Is there a way we can solve this in esm_runscripts?

@mandresm
Copy link
Contributor

I thought this was solved quite some time ago. Can you please provide the versions and the machine?

@denizural
Copy link
Contributor

I am not sure but probably esm-runscript does not see that model has crashed and thinks that it needs to start a new simulation.

I actually developed a better runtime scheduler for that which checks the slurm output and decides if it needs to submit the new job or to kill eveything. Ie. think of it as echo $?

Unfortunately some system admins block the access to sacct command of Slurm (eg. ollie), so it will not be available on all systems.

@seb-wahl, do you think that your problem is related to this issue. Then I can bring this to the table on Thursday's meeting.

@seb-wahl
Copy link
Contributor Author

Yes that's exactly something I'm looking for. An equivalent of if [[ $? -ne 0 ]] ; then do stop esm runscripts and tell the user that his run has crashed. We can discuss the details next week.

@pgierz
Copy link
Member

pgierz commented Aug 20, 2021

We solved the case of 127 exits in ECHAM by just changing it to 0. While I am not a fan of changing the model to fit the infrastructure (in my view, you shouldn't need to reprogram your model to make the rest work), it was the fastest solution.

@mandresm
Copy link
Contributor

I am still surprised that ESM-Tools moves forward after the model crashes. If you look at the following lines in slurm.yaml you'll see that the srun: error: is contemplated there which means it should be stopping.
https://github.com/esm-tools/esm_tools/blob/bbfed3ff5c63135a02bfe216e3e4881d1d15a360/configs/other_software/batch_system/slurm.yaml#L98-L109

I think is worth investigating it but I need the versions for that.

@seb-wahl
Copy link
Contributor Author

I guess it's because of frequency: 600.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants