Gaea C6 support for UFSWM #2448

BrianCurtis-NOAA · 2024-10-02T17:48:48Z

Commit Queue Requirements:

Fill out all sections of this template.
All sub component pull requests have been reviewed by their code managers.
Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

Normal

Git Tracking

UFSWM:

Closes Enable ufs-weather-model on Gaea-C6 #2407
None

Sub component Pull Requests:

None

UFSWM Blocking Dependencies:

Blocked by #
None

Changes

Regression Test Changes (Please commit test_changes.list):

No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

None.

Library Changes/Upgrades:

No Updates

Testing Log:

BrianCurtis-NOAA · 2024-10-02T17:53:59Z

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

…into gaeac6

BrianCurtis-NOAA · 2024-10-02T17:55:49Z

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

RatkoVasic-NOAA · 2024-10-04T00:50:55Z

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

sanAkel · 2024-10-04T01:01:09Z

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

DusanJovic-NOAA · 2024-10-04T13:43:02Z

tests/compile.sh

@@ -95,7 +98,7 @@ export SUITES
 set -ex

 # Valid applications
-if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
+if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm


Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?

Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON, though I would differ it to @junwang-noaa.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).

So you do not use tests/compile.sh to build standalone test, is that correct?

So you do not use tests/compile.sh to build standalone test, is that correct?

correct

Then we should remove it from compile.sh

BrianCurtis-NOAA · 2024-10-04T14:12:49Z

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

BrianCurtis-NOAA · 2024-10-04T14:51:32Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

DusanJovic-NOAA · 2024-10-04T15:03:16Z

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

ulmononian · 2024-10-16T17:21:29Z

@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the bil-fire8 project (disk space and compute resources). i was able to run a control_c48 test using this allocation in /gpfs/f6/bil-fire8/scratch/role.epic/ufs-wm_2448 with run_dir at /gpfs/f6/bil-fire8/scratch/role.epic/RT_RUNDIRS/role.epic/FV3_RT/rt_1552059, but i had to create new baselines since they are not yet staged on c6. seems like rocoto should be installed on c6 as well (@natalie-perlin).

jkbk2004 · 2024-10-16T19:59:59Z

@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs.

jkbk2004 · 2024-10-17T13:00:25Z

Continue to see failures with various cases.

atmaero_control_p8_intel failed in run_test
cpld_bmark_p8_intel failed in run_test
cpld_control_ciceC_p8_intel failed in run_test
cpld_control_p8_faster_intel failed in run_test
cpld_control_p8_intel failed in run_test
cpld_control_p8_mixedmode_intel failed in run_test
cpld_control_p8.v2.sfc_intel failed in run_test
cpld_debug_p8_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:

- cpld_bmark_p8_intel:
 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:
592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192: forrtl: error (78): process killed (SIGTERM)
- regional_atmaq_debug_intel:
srun: error: c6n0014: tasks 0-191: Killed
srun: Terminating StepId=207205194.0
327: forrtl: error (78): process killed (SIGTERM)
327: Image              PC                Routine            Line        Source
327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown
327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown
327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown
- all other failed cases :
 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

aerorahul · 2024-10-17T14:10:36Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA
Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6.
Having no delimiter would be even better as in gaeac5 and gaeac6 Most
MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous.
Thanks for your consideration.

RatkoVasic-NOAA · 2024-10-17T14:55:10Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

Any combination is OK, as long as they are same length.

ulmononian · 2024-10-17T16:26:28Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

BrianCurtis-NOAA · 2024-10-17T18:00:26Z

@BrianCurtis-NOAA, name change suggestion:
gaea -----> gaea-c5
gaeac6 ---> gaea-c6
@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.
@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow.

RatkoVasic-NOAA · 2024-10-17T18:29:48Z

@BrianCurtis-NOAA @ulmononian @jkbk2004
Since Gaea C5, and Gaea C6 are almost identical, I suggest you expand this PR to include changes to C5 as well.

Changes in rt.sh:
    export LD_PRELOAD=/usr/lib64/libstdc++.so.6
    module load PrgEnv-intel/8.5.0
    module load intel-classic/2023.2.0
    module load cray-mpich/8.1.28
    module load python/3.9.12
Change in ./modulefiles/ufs_gaea.intel.lua:
    stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
    load(pathJoin("stack-intel", stack_intel_ver))
    stack_cray_mpich_ver=os.getenv("stack_cray_mpich_ver") or "8.1.28"
    load(pathJoin("stack-cray-mpich", stack_cray_mpich_ver))
Change in ./tests/run_test.sh:
-    module load stack-intel/2023.1.0 stack-cray-mpich/8.1.25
+    module load stack-intel/2023.2.0 stack-cray-mpich/8.1.28

Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea:
export FI_VERBS_PREFER_XRC=0

ulmononian · 2024-10-17T18:49:08Z

Continue to see failures with various cases.


atmaero_control_p8_intel failed in run_test

cpld_bmark_p8_intel failed in run_test

cpld_control_ciceC_p8_intel failed in run_test

cpld_control_p8_faster_intel failed in run_test

cpld_control_p8_intel failed in run_test

cpld_control_p8_mixedmode_intel failed in run_test

cpld_control_p8.v2.sfc_intel failed in run_test

cpld_debug_p8_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test

regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:


- cpld_bmark_p8_intel:

 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:

592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

192: forrtl: error (78): process killed (SIGTERM)

- regional_atmaq_debug_intel:

srun: error: c6n0014: tasks 0-191: Killed

srun: Terminating StepId=207205194.0

327: forrtl: error (78): process killed (SIGTERM)

327: Image              PC                Routine            Line        Source

327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown

327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown

327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown

- all other failed cases :

 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:

 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0.

this is a known issue inherent to the c5 system. may also try for c6.

RatkoVasic-NOAA · 2024-10-17T22:09:19Z

@jkbk2004 @BrianCurtis-NOAA
I just ran one of the tests that was failing on C6 (atmaero_control_p8_intel) and used export FI_VERBS_PREFER_XRC=0 in the job card. It passed on C5 (/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_3061724/atmaero_control_p8_intel/)
Can you try it on C6 as well?
It was up to new system installation, and @ulmononian found fix from admins' notes.

RatkoVasic-NOAA · 2024-10-18T03:34:15Z

@BrianCurtis-NOAA @jkbk2004 @ulmononian
All tests passed on Gaea C5:

/gpfs/f5/epic/scratch/Ratko.Vasic/WM-1.6.0/ufs-weather-model/tests
/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_432914

ECFLOW Tasks Remaining: 0/231
rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

Performing Cleanup...
REGRESSION TEST RESULT: SUCCESS
******Regression Testing Script Completed******

If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here.
Did you have time to try same fix for C6?

BrianCurtis-NOAA · 2024-10-18T12:07:44Z

Let me put all of this together and update this PR.

DeniseWorthen · 2024-10-31T21:10:46Z

@jkbk2004 What issue are you seeing in the bmark case? Did SAs suggest this new export variable?

jkbk2004 · 2024-10-31T21:38:57Z

769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657 c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

@jkbk2004 What issue are you seeing in the bmark case? Did SAs suggest this new export variable?

It looks like FI_CXI_RX_MATCH_MODE=hardware by default and MPI message counters jam up. With hybrid, it runs ok w/o crashing. I am not sure if MPI communication performance with different parameters. I think the other crashing cases (hafs, atmaq) are related to MPI/OFI layers, which may indicate an issue with MPI package installation (some conflict with network hardware supprot).

jkbk2004 · 2024-10-31T21:47:07Z

@DeniseWorthen I am not sure if increasing resources may help out the crashing cases (win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use). Maybe ulimit -s unlimited as well

ulmononian · 2024-11-04T19:29:26Z

@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen what's the status of testing on this PR? are there specific tests failing that we can help look into?

jkbk2004 · 2024-11-04T20:02:34Z

@ulmononian pretty much same situation with hafs, cpld control, and regional atmaq cases are crashing with mpich issue. Since main issue is with MPI/OFI, not much we can do on code manager sides. It's lib issue.

DeniseWorthen · 2024-11-04T22:16:17Z

@ulmononian I was not testing the RTs on C6. I was testing for my WW3+PIO PR.

ulmononian · 2024-11-04T22:23:45Z

@DeniseWorthen thanks for clarifying.

@jkbk2004 is there a recent rundir or log of which specific tests failed? otherwise, we will need to run the full suite again.

…al-workflow (#1361) After the recent Gaea-C5 OS upgrade, GDASApp fails to build. This issue corrects Gaea-C5 build and updates the build scripts to conform to ufs-wx-model (following ufs-wx-model ufs-community/ufs-weather-model#2448) and eventual global-workflow updates. Refs NOAA-EMC/global-workflow 3011 NOAA-EMC/global-workflow#3011 Refs NOAA-EMC/global-workflow 3032 NOAA-EMC/global-workflow#3032 Resolves #1360

ulmononian · 2024-11-07T19:38:25Z

seems like significant issues with rocoto scheduler on c6. jobs that compile/run fine serially are failing to compile when using rocoto....

based on @RatkoVasic-NOAA's suggestion, i changed the stanza for rocoto in rt.sh to

  gaeac6)
    echo "rt.sh: Setting up gaea c6..."
    if [[ "${ROCOTO:-false}" == true ]] ; then
      module use /ncrc/proj/epic/c6/modulefiles
      module load rocoto/1.3.7
      ROCOTO_SCHEDULER="slurm"
    fi

something is up

jswhit2 · 2024-11-14T15:05:29Z

The submodule pointer for upp needs to be updated to get the c5/c6 support that was recently merged into develop.

junwang-noaa · 2024-11-14T15:20:20Z

@WenMeng-NOAA already created a PR #2489 to update UPP. Maybe we can combine Wen's PR with this one if this is ready to commit.

jkbk2004 · 2024-11-14T16:14:50Z

@WenMeng-NOAA already created a PR #2489 to update UPP. Maybe we can combine Wen's PR with this one if this is ready to commit.

As #2489 needs baseline update, I think we can schedule to commit UPP update sometime later next week.

BrianCurtis-NOAA · 2024-11-18T18:51:33Z

@RatkoVasic-NOAA @ulmononian Do either of you have a timeline on when things might come together for this?

RatkoVasic-NOAA · 2024-11-18T20:24:08Z

Hi @BrianCurtis-NOAA , I'm now running rt.sh with -c option. I see already some failures:

-rw-r--r-- 1 Ratko.Vasic bil-fire8 44 Nov 18 2024 19:35:11 fail_test_atmaero_control_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 47 Nov 18 2024 19:06:13 fail_test_cpld_control_ciceC_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 48 Nov 18 2024 19:13:18 fail_test_cpld_control_p8_faster_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 41 Nov 18 2024 19:06:14 fail_test_cpld_control_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 51 Nov 18 2024 19:07:27 fail_test_cpld_control_p8_mixedmode_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 48 Nov 18 2024 19:06:05 fail_test_cpld_control_p8.v2.sfc_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 39 Nov 18 2024 19:04:29 fail_test_cpld_debug_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 46 Nov 18 2024 19:38:05 fail_test_regional_atmaq_debug_intel

After finished, we can see what the errors are.
My work directory is:
/gpfs/f6/bil-fire8/world-shared/Ratko.Vasic/ufs-weather-model/tests
Only 3 tasks left to go. I don't expect more failures.

RatkoVasic-NOAA · 2024-11-18T20:26:54Z

I spoke too early, here is message from running hafs model:
594: PE 594: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
This error is fixed here.

RatkoVasic-NOAA · 2024-11-19T16:57:46Z

In coupled cases there is MPI error like this:

  9: MPICH ERROR [Rank 9] [job id 207293865.0] [Mon Nov 18 19:04:10 2024] [c6n0294] - Abort(605699855) (rank 9 in comm 0): Fatal error in PMPI_Win_create: Other MPI erro
r, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)
  9:
  9: aborting job:
  9: Fatal error in PMPI_Win_create: Other MPI error, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

Has anyone seen this kind of error before? It is related to Cray/MPI. (@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen )

initial testing to get UFSWM working on Gaea C6

b968b96

Merge branch 'develop' of github.com:ufs-community/ufs-weather-model …

efe342e

…into gaeac6

jkbk2004 mentioned this pull request Oct 4, 2024

Enable ufs-weather-model on Gaea-C6 #2407

Open

BrianCurtis-NOAA added 4 commits October 4, 2024 08:53

gaea->gaea-c5 and gaeac6->gaea-c6

7476837

Fixed linter issue

742a7c2

Update to 192 cores on Gaea-c6

5bee5b2

Update tests to gaea-c5 and added gaea-c6 where necessary

63a56ac

DusanJovic-NOAA reviewed Oct 4, 2024

View reviewed changes

Remove MOM6SOLO from compile.sh

bb83396

Merge branch 'develop' into gaeac6

532f418

RatkoVasic-NOAA mentioned this pull request Oct 18, 2024

Gaea C5 lib issue #2472

Closed

DavidBurrows-NCO mentioned this pull request Nov 4, 2024

Add/fix build capability for Gaea-C5 and Gaea-C6 NOAA-EMC/GDASApp#1360

Closed

DavidBurrows-NCO mentioned this pull request Nov 5, 2024

Update build scripts on Gaea-C5 to conform with ufs-wx-model and global-workflow NOAA-EMC/GDASApp#1361

Merged

Update rt.sh

ece273d

RatkoVasic-NOAA added 2 commits November 15, 2024 01:55

Merge remote-tracking branch 'upstream/develop' into gaeac6

103bd7c

Update rocoto and ecflow module loading for Gaea-C6

1d6908c

DavidBurrows-NCO mentioned this pull request Nov 15, 2024

Clone, build, and run C48_ATM, C48_S2SW, and C96_atm3DVar on Gaea C5 and C6 NOAA-EMC/global-workflow#3106

Draft

19 tasks

Fix HAFS runtime errors.

fa68c07

RatkoVasic-NOAA and others added 7 commits November 24, 2024 18:32

Change work-dir to open-for-read space.

8942489

Gaea C6 additions

7b17f0f

Increase number of nodes for some cases on Gaea C6

6107caa

Correct errors (variable MACHINE_ID to RT_COMPILER)

32b8418

Adjust TPN for some test cases.

3dcee6e

Merge branch 'develop' into gaeac6

a035799

Fix AND to OR in if statement.

29cef49

Gaea C6 support for UFSWM #2448

Are you sure you want to change the base?

Gaea C6 support for UFSWM #2448

Conversation

BrianCurtis-NOAA commented Oct 2, 2024 • edited by RatkoVasic-NOAA Loading

Commit Queue Requirements:

Description:

Commit Message:

Priority:

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Changes

Regression Test Changes (Please commit test_changes.list):

Input data Changes:

Library Changes/Upgrades:

Testing Log:

BrianCurtis-NOAA commented Oct 2, 2024

BrianCurtis-NOAA commented Oct 2, 2024

RatkoVasic-NOAA commented Oct 4, 2024

sanAkel commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanAkel Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DusanJovic-NOAA Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

BrianCurtis-NOAA commented Oct 4, 2024

BrianCurtis-NOAA commented Oct 4, 2024

DusanJovic-NOAA commented Oct 4, 2024

ulmononian commented Oct 16, 2024 • edited Loading

jkbk2004 commented Oct 16, 2024

jkbk2004 commented Oct 17, 2024

aerorahul commented Oct 17, 2024 • edited Loading

RatkoVasic-NOAA commented Oct 17, 2024

ulmononian commented Oct 17, 2024

BrianCurtis-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024 • edited Loading

ulmononian commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 17, 2024

RatkoVasic-NOAA commented Oct 18, 2024

BrianCurtis-NOAA commented Oct 18, 2024

DeniseWorthen commented Oct 31, 2024

jkbk2004 commented Oct 31, 2024

jkbk2004 commented Oct 31, 2024

ulmononian commented Nov 4, 2024

jkbk2004 commented Nov 4, 2024 • edited Loading

DeniseWorthen commented Nov 4, 2024

ulmononian commented Nov 4, 2024

ulmononian commented Nov 7, 2024 • edited Loading

jswhit2 commented Nov 14, 2024

junwang-noaa commented Nov 14, 2024

jkbk2004 commented Nov 14, 2024

BrianCurtis-NOAA commented Nov 18, 2024

RatkoVasic-NOAA commented Nov 18, 2024 • edited Loading

RatkoVasic-NOAA commented Nov 18, 2024 • edited Loading

RatkoVasic-NOAA commented Nov 19, 2024

BrianCurtis-NOAA commented Oct 2, 2024 •

edited by RatkoVasic-NOAA

Loading

sanAkel Oct 4, 2024 •

edited

Loading

DusanJovic-NOAA Oct 4, 2024 •

edited

Loading

ulmononian commented Oct 16, 2024 •

edited

Loading

aerorahul commented Oct 17, 2024 •

edited

Loading

RatkoVasic-NOAA commented Oct 17, 2024 •

edited

Loading

jkbk2004 commented Nov 4, 2024 •

edited

Loading

ulmononian commented Nov 7, 2024 •

edited

Loading

RatkoVasic-NOAA commented Nov 18, 2024 •

edited

Loading

RatkoVasic-NOAA commented Nov 18, 2024 •

edited

Loading