compiling on GAEA failure #1666
Replies: 17 comments 14 replies
-
@jiandewang let me check. @natalie-perlin FYI: just in case hpc stack issue on gaea |
Beta Was this translation helpful? Give feedback.
-
Could be a permission related issue. @natalie-perlin are you able to confirm the stack modules on Gaea are available for all to use? |
Beta Was this translation helpful? Give feedback.
-
@jiandewang @zach1221 @jkbk2004 The Lmod is likely not initialized properly. Doing module purge is not the right thing to do on Gaea, so this could be the cause. Login to Gaea is closed now, apparently due to maintenance. |
Beta Was this translation helpful? Give feedback.
-
@natalie-perlin Gaea c4 was not available and in service pretty much last week. It was back on-line from weekend, I guess. We need to make sure current hpc stack is functional. |
Beta Was this translation helpful? Give feedback.
-
Maintenance is on-going. we will test on Gaea after maintenance is done. |
Beta Was this translation helpful? Give feedback.
-
@natalie-perlin now GAEA is back so I just had a try and still got the same issue apparently Lmod is not initialized here, so my question is: is there any module that I need to pre-load before run the rt.sh ? |
Beta Was this translation helpful? Give feedback.
-
I think it is the default shell that is causing the trouble. I have asked Bin.Li, Denise, and now Jong for a test, none of them had trouble and all of their default shell is bash. But mine is tcsh. Previously I had to switch to bash manually, then do module purge, then run rt.sh, that worked fine. But this method stopped to work 2 months ago. |
Beta Was this translation helpful? Give feedback.
-
@jiandewang @jkbk2004 @JessicaMeixner-NOAA - If you happen to run A simple test of the ufs-weather-model being compiled on Gaea, without running RTs, using the build.sh script approach:
Please see below some other piece of information that might add some insight on how and why the modules are set on Gaea in this way, and the way the solution was found for both login nodes and compute nodes. The ways things are done when you run /lustre/f2/dev/role.epic/contrib/Lmod_init.sh: 1) First, the list of currently loaded modules is stored in a variable array; 2) modules are purged; 3) User's profile shell from the installed Lmod library is initialized; 3) default system module environment is loaded, modules/3.2.11.4; 4) the remaining modules determined in step 1 are loaded in a hierarchical order; 5) environment variable MODULESHOME is set to point to its path corresponding to Lmod. @jiandewang - please test it again with tcsh. I've made a small adjustment to a Lmod_init.sh script in step 3: instead of user's default shell sourced from Lmod library (tcsh in your case), a BASH shell from the Lmod library is always sourced. (It may or may not be a solution) There is another reason why we cannot explicitly specify (i.e., hard-code) modules to be loaded in step 4, and need to use a flexible way of getting the list of modules in step 1. The list of default modules differs on login nodes and on compute nodes, so there is no one-fits-all list. |
Beta Was this translation helpful? Give feedback.
-
@jiandewang @jkbk2004 - |
Beta Was this translation helpful? Give feedback.
-
@natalie-perlin the problem remains that the ufs_gaea.intel modules will not load on the c4 compute nodes (although they will on the front ends). That means the runtime libraries aren't available and the model will fail. I believe this is due to the fact that only PrgEnv-intel/6.0.10 is available on the c4 compute nodes (PrgEnv-intel/6.0.5 is not, which is what is used for compilation). I have filed a gaea help ticket on this but so far no resolution. |
Beta Was this translation helpful? Give feedback.
-
@jswhit - thank you for testing!! Yes, there is a difference in PrgEnv modules on C4. But the main issue I see is that the module intel/2021.3.0 that was used to build the hpc-stack is not available either C4 ( or C5), for that matter... What is available on all three systems is intel/2022.0.2. The first solution that comes to mind is to rebuild the software stack with intel/2022.0.2. The PrgEnv-intel module is also loaded during the build, and thus the differences in PrgEnv-intel versions on C3 and C4/C5 would remain - but it may or may not be a dealbreaker. |
Beta Was this translation helpful? Give feedback.
-
An update on software stacks built for Gaea: An updated stack had been prepared upgraded C3 and C4 partition on Gaea The ufs_gaea.intel.lua module loads the stack as following:
A subset of regression tests (from # ATM tests line untill the end of the list in rt.conf) has successfully completed. |
Beta Was this translation helpful? Give feedback.
-
@jkbk2004 @jiandewang - |
Beta Was this translation helpful? Give feedback.
-
@natalie-perlin sure you can labled as answered |
Beta Was this translation helpful? Give feedback.
-
@jkbk2004 - could somebody who has permissions mark this Q. as "answered"? |
Beta Was this translation helpful? Give feedback.
-
answered |
Beta Was this translation helpful? Give feedback.
-
I got error in compiling UFS on GAEA (using the head of UFS as of today)
Lmod has detected the following error: These module(s) or extension(s) exist
but cannot be loaded as requested: "cmake/3.20.1"
Try: "module spider cmake/3.20.1" to see how to load the module(s).
I had module purge before I launch the job. Is there any step that I missed ?
error log: /lustre/f2/scratch/Jiande.Wang/FV3_RT/rt_42430/compile_001
Beta Was this translation helpful? Give feedback.
All reactions