-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALBEDO FESOM2 writing 3d output stream issue #396
Comments
OK ?! it seems to be also mesh dependent: |
Hi Patrick, I'll look into this. Do you have a template compile script somewhere if I need to play around with it? |
Nevermind, already posted. Sorry for not seeing that ;-) |
You can basically use the branch refactoring_albedo_env, there i played around (compiler options, environment variables, albedo job scripts ...) to make FESOM2 work on albedo |
Are you sure everything is checked in, Patrick? I'd like to directly reproduce your problem on my account, but your job script links in a |
Ahh, no i did not edit the default namelists for albedo!!! But you can take them directly from my albedo directory /albedo/home/pscholz/fesom2_refactoring/work_dart_pc0. I only edited the environment files and the job scipts for albedo use. |
OK, I found the log files through slurm. Can you please try this also using From
|
First glance also seems to point to out-of-memory or invalid memory access error...? If it works in the smaller case but not in the larger one. |
OK thats different, i always used the full debug options -g -traceback -check all,noarg_temp_created,bounds,uninit there got stucked at a much earlier point that i think is not related to that problem. Because i get stucked after he already wrote data to file ...
|
@pgierz I just tried only -g -traceback i cant trigger any error message. Is the error message you found simply from exceeding the wall clock time limit and the node killing off everything? |
No, I was just trying to get to the root of this |
@patrickscholz: Then maybe it is indeed some kind of an allocation error. So if understand correctly: Case 1: you compile with full debug flags for FESOM, and then get a complaint:
Here's the variable it is complaining about: Line 25 in 00df069
It also seems to be allocated here: Line 747 in 00df069
Case 2: you compile with NEC settings, and then run into a segmentation fault. Maybe something is messing up in the allocation if you have a lot more MPI tasks? |
Im not sure ... this allocation should happen before you write and not afterwards, since the data are already stored in file when he hangs up. I have the impression this is another issue |
@hegish & @pgierz i further identified the issue, which is here Line 826 in 0d7d80d
Lines 808 to 830 in 0d7d80d
every time the model writes a 2d slice of the 3d data via call assert_nf(...) the model needs longer and longer to write that 2d data slice until it looks like the model hang up. No Idea whats the exact cause ...
... But it looks like this problem can be solved by using Jans workaround for ALEPH also on ALBEDO Lines 813 to 816 in 0d7d80d
When applying Jans ALEPH workaround the writing times for each 2d data slice reduce to ...
... and model finishes properly!!! |
If the ENABLE_ALEPH_CRAYMPICH_WORKAROUNDS solves things, then there is an issue with MPI on Albedo. Which output precision do you set in the namelist.io? |
@hegish I played around with single precision output. To grant access to albedo i think malte thoma and maybe paul gierz is responsible. But im not sure they will be available over the holidays. |
The weird thing on Aleph was: the slowdown each level was much worse for single precision (real4) output. real8 was much faster. |
There seems to be an issue on ALBEDO when writing 3D output variables! The 3d output seems to be written into file, at least the written 3d files are full
-rw-r--r-- 1 pscholz hpc_user 574M Dec 15 17:17 u.fesom.1958.nc
...but afterwards the stream seems to become stucked at...
This only happens when defining output streams for 3d variables, as long as i only define 2d variables the model output seems to be fine.
with these compiler options and environment variables we reach on albedo (neglecting the output) the same performance as on levante
Things compile fine, no error message can be triggered !
I tried with and without asynchronous I/O (DISABLE_MULTITHREADING ON/OFF) both times the same problem
Anybody (@hegish ) an idea what could be the problem?
The text was updated successfully, but these errors were encountered: