Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaea C6 support for UFSWM #2448

Draft
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

BrianCurtis-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA commented Oct 2, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • Blocked by #
  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

@BrianCurtis-NOAA
Copy link
Collaborator Author

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@sanAkel
Copy link

sanAkel commented Oct 4, 2024

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

tests/compile.sh Outdated
@@ -95,7 +98,7 @@ export SUITES
set -ex

# Valid applications
if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON, though I would differ it to @junwang-noaa.

Copy link

@sanAkel sanAkel Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

correct

Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should remove it from compile.sh

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

@BrianCurtis-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

@ulmononian
Copy link
Collaborator

ulmononian commented Oct 16, 2024

@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the bil-fire8 project (disk space and compute resources). i was able to run a control_c48 test using this allocation in /gpfs/f6/bil-fire8/scratch/role.epic/ufs-wm_2448 with run_dir at /gpfs/f6/bil-fire8/scratch/role.epic/RT_RUNDIRS/role.epic/FV3_RT/rt_1552059, but i had to create new baselines since they are not yet staged on c6. seems like rocoto should be installed on c6 as well (@natalie-perlin).

@jkbk2004
Copy link
Collaborator

@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs.

@jkbk2004
Copy link
Collaborator

Continue to see failures with various cases.

atmaero_control_p8_intel failed in run_test
cpld_bmark_p8_intel failed in run_test
cpld_control_ciceC_p8_intel failed in run_test
cpld_control_p8_faster_intel failed in run_test
cpld_control_p8_intel failed in run_test
cpld_control_p8_mixedmode_intel failed in run_test
cpld_control_p8.v2.sfc_intel failed in run_test
cpld_debug_p8_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test
hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test
regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:

- cpld_bmark_p8_intel:
 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:
592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...
592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
192: forrtl: error (78): process killed (SIGTERM)
- regional_atmaq_debug_intel:
srun: error: c6n0014: tasks 0-191: Killed
srun: Terminating StepId=207205194.0
327: forrtl: error (78): process killed (SIGTERM)
327: Image              PC                Routine            Line        Source
327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown
327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown
327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown
327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown
- all other failed cases :
 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

@aerorahul
Copy link
Contributor

aerorahul commented Oct 17, 2024

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA
Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6.
Having no delimiter would be even better as in gaeac5 and gaeac6 Most
MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous.
Thanks for your consideration.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

Any combination is OK, as long as they are same length.

@ulmononian
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@RatkoVasic-NOAA @BrianCurtis-NOAA Would using _ be amenable instead of -? As in gaea_c5 and gaea_c6. Having no delimiter would be even better as in gaeac5 and gaeac6 Most MACHINE_ID's are formed that way, so it retains that consistency, and makes operations such as cut -d. -f unambiguous. Thanks for your consideration.

@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions.

I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow.

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Oct 17, 2024

@BrianCurtis-NOAA @ulmononian @jkbk2004
Since Gaea C5, and Gaea C6 are almost identical, I suggest you expand this PR to include changes to C5 as well.

Changes in rt.sh:
    export LD_PRELOAD=/usr/lib64/libstdc++.so.6
    module load PrgEnv-intel/8.5.0
    module load intel-classic/2023.2.0
    module load cray-mpich/8.1.28
    module load python/3.9.12
Change in ./modulefiles/ufs_gaea.intel.lua:
    stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
    load(pathJoin("stack-intel", stack_intel_ver))
    stack_cray_mpich_ver=os.getenv("stack_cray_mpich_ver") or "8.1.28"
    load(pathJoin("stack-cray-mpich", stack_cray_mpich_ver))
Change in ./tests/run_test.sh:
-    module load stack-intel/2023.1.0 stack-cray-mpich/8.1.25
+    module load stack-intel/2023.2.0 stack-cray-mpich/8.1.28

Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea:
export FI_VERBS_PREFER_XRC=0

@ulmononian
Copy link
Collaborator

Continue to see failures with various cases.


atmaero_control_p8_intel failed in run_test

cpld_bmark_p8_intel failed in run_test

cpld_control_ciceC_p8_intel failed in run_test

cpld_control_p8_faster_intel failed in run_test

cpld_control_p8_intel failed in run_test

cpld_control_p8_mixedmode_intel failed in run_test

cpld_control_p8.v2.sfc_intel failed in run_test

cpld_debug_p8_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_intel failed in run_test

hafs_regional_storm_following_1nest_atm_ocn_wav_mom6_intel failed in run_test

regional_atmaq_debug_intel failed in run_test

About 3 different behaviors and error messages:


- cpld_bmark_p8_intel:

 769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657<warn> c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

- hafs_regional_storm_following_1nest_atm_ocn_wav_inline_intel:

592: PE 592: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.]  OFI retry continuing...

592: 0: slurmstepd: error: *** STEP 207205202.0 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

slurmstepd: error: *** JOB 207205202 ON c6n0220 CANCELLED AT 2024-10-16T17:59:54 DUE TO TIME LIMIT ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

192: forrtl: error (78): process killed (SIGTERM)

- regional_atmaq_debug_intel:

srun: error: c6n0014: tasks 0-191: Killed

srun: Terminating StepId=207205194.0

327: forrtl: error (78): process killed (SIGTERM)

327: Image              PC                Routine            Line        Source

327: libpthread-2.31.s  00007F643D216910  Unknown               Unknown  Unknown

327: libc-2.31.so       00007F643A43EB57  __sched_yield         Unknown  Unknown

327: libmpi_intel.so.1  00007F643BECB44F  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643BF5C4B6  Unknown               Unknown  Unknown

327: libmpi_intel.so.1  00007F643A7DE41D  MPI_Bcast             Unknown  Unknown

- all other failed cases :

 16: MPICH ERROR [Rank 16] [job id 207205189.0] [Wed Oct 16 21:12:57 2024] [c6n0220] - Abort(1009925903) (rank 16 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:

 16: PMPI_Win_create(294)................: MPI_Win_create(base=0x7ffce7fce7a0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc400002a, win=0x7ffce7fce8fc) failed

@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side.

please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0.

this is a known issue inherent to the c5 system. may also try for c6.

@RatkoVasic-NOAA
Copy link
Collaborator

@jkbk2004 @BrianCurtis-NOAA
I just ran one of the tests that was failing on C6 (atmaero_control_p8_intel) and used export FI_VERBS_PREFER_XRC=0 in the job card. It passed on C5 (/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_3061724/atmaero_control_p8_intel/)
Can you try it on C6 as well?
It was up to new system installation, and @ulmononian found fix from admins' notes.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA @jkbk2004 @ulmononian
All tests passed on Gaea C5:

/gpfs/f5/epic/scratch/Ratko.Vasic/WM-1.6.0/ufs-weather-model/tests
/gpfs/f5/epic/scratch/Ratko.Vasic/RT_RUNDIRS/Ratko.Vasic/FV3_RT/rt_432914
ECFLOW Tasks Remaining: 0/231
rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

Performing Cleanup...
REGRESSION TEST RESULT: SUCCESS
******Regression Testing Script Completed******

If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here.
Did you have time to try same fix for C6?

@BrianCurtis-NOAA
Copy link
Collaborator Author

Let me put all of this together and update this PR.

@DeniseWorthen
Copy link
Collaborator

@jkbk2004 What issue are you seeing in the bmark case? Did SAs suggest this new export variable?

@jkbk2004
Copy link
Collaborator

769: libfabric:2470915:1729115914::cxi:core:cxip_ux_onload_cb():2657 c6n0025: RXC (0x8b2:1) PtlTE 495:[Fatal] LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required

@jkbk2004 What issue are you seeing in the bmark case? Did SAs suggest this new export variable?

It looks like FI_CXI_RX_MATCH_MODE=hardware by default and MPI message counters jam up. With hybrid, it runs ok w/o crashing. I am not sure if MPI communication performance with different parameters. I think the other crashing cases (hafs, atmaq) are related to MPI/OFI layers, which may indicate an issue with MPI package installation (some conflict with network hardware supprot).

@jkbk2004
Copy link
Collaborator

@DeniseWorthen I am not sure if increasing resources may help out the crashing cases (win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use). Maybe ulimit -s unlimited as well

@ulmononian
Copy link
Collaborator

@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen what's the status of testing on this PR? are there specific tests failing that we can help look into?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Nov 4, 2024

@ulmononian pretty much same situation with hafs, cpld control, and regional atmaq cases are crashing with mpich issue. Since main issue is with MPI/OFI, not much we can do on code manager sides. It's lib issue.

@DeniseWorthen
Copy link
Collaborator

@ulmononian I was not testing the RTs on C6. I was testing for my WW3+PIO PR.

@ulmononian
Copy link
Collaborator

@DeniseWorthen thanks for clarifying.

@jkbk2004 is there a recent rundir or log of which specific tests failed? otherwise, we will need to run the full suite again.

CoryMartin-NOAA pushed a commit to NOAA-EMC/GDASApp that referenced this pull request Nov 6, 2024
…al-workflow (#1361)

After the recent Gaea-C5 OS upgrade, GDASApp fails to build.
This issue corrects Gaea-C5 build and updates the build scripts to
conform to ufs-wx-model (following ufs-wx-model
ufs-community/ufs-weather-model#2448) and
eventual global-workflow updates.

Refs NOAA-EMC/global-workflow 3011
NOAA-EMC/global-workflow#3011
Refs NOAA-EMC/global-workflow 3032
NOAA-EMC/global-workflow#3032
Resolves #1360
@ulmononian
Copy link
Collaborator

ulmononian commented Nov 7, 2024

seems like significant issues with rocoto scheduler on c6. jobs that compile/run fine serially are failing to compile when using rocoto....

based on @RatkoVasic-NOAA's suggestion, i changed the stanza for rocoto in rt.sh to

  gaeac6)
    echo "rt.sh: Setting up gaea c6..."
    if [[ "${ROCOTO:-false}" == true ]] ; then
      module use /ncrc/proj/epic/c6/modulefiles
      module load rocoto/1.3.7
      ROCOTO_SCHEDULER="slurm"
    fi

something is up

@jswhit2
Copy link
Collaborator

jswhit2 commented Nov 14, 2024

The submodule pointer for upp needs to be updated to get the c5/c6 support that was recently merged into develop.

@junwang-noaa
Copy link
Collaborator

@WenMeng-NOAA already created a PR #2489 to update UPP. Maybe we can combine Wen's PR with this one if this is ready to commit.

@jkbk2004
Copy link
Collaborator

@WenMeng-NOAA already created a PR #2489 to update UPP. Maybe we can combine Wen's PR with this one if this is ready to commit.

As #2489 needs baseline update, I think we can schedule to commit UPP update sometime later next week.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@RatkoVasic-NOAA @ulmononian Do either of you have a timeline on when things might come together for this?

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Nov 18, 2024

Hi @BrianCurtis-NOAA , I'm now running rt.sh with -c option. I see already some failures:

-rw-r--r-- 1 Ratko.Vasic bil-fire8 44 Nov 18 2024 19:35:11 fail_test_atmaero_control_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 47 Nov 18 2024 19:06:13 fail_test_cpld_control_ciceC_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 48 Nov 18 2024 19:13:18 fail_test_cpld_control_p8_faster_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 41 Nov 18 2024 19:06:14 fail_test_cpld_control_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 51 Nov 18 2024 19:07:27 fail_test_cpld_control_p8_mixedmode_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 48 Nov 18 2024 19:06:05 fail_test_cpld_control_p8.v2.sfc_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 39 Nov 18 2024 19:04:29 fail_test_cpld_debug_p8_intel
-rw-r--r-- 1 Ratko.Vasic bil-fire8 46 Nov 18 2024 19:38:05 fail_test_regional_atmaq_debug_intel

After finished, we can see what the errors are.
My work directory is:
/gpfs/f6/bil-fire8/world-shared/Ratko.Vasic/ufs-weather-model/tests
Only 3 tasks left to go. I don't expect more failures.

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Nov 18, 2024

I spoke too early, here is message from running hafs model:
594: PE 594: MPICH WARNING: OFI is failing to make progress on posting a send. MPICH suspects a hang due to rendezvous message resource exhaustion. If running Slingshot 2.1 or later, setting environment variable FI_CXI_DEFAULT_TX_SIZE large enough to handle the maximum number of outstanding rendezvous messages per rank should prevent this scenario. [ If running on a Slingshot release prior to version 2.1, setting environment variable FI_CXI_RDZV_THRESHOLD to a larger value may circumvent this scenario by sending more messages via the eager path.] OFI retry continuing...
This error is fixed here.

@RatkoVasic-NOAA
Copy link
Collaborator

In coupled cases there is MPI error like this:

  9: MPICH ERROR [Rank 9] [job id 207293865.0] [Mon Nov 18 19:04:10 2024] [c6n0294] - Abort(605699855) (rank 9 in comm 0): Fatal error in PMPI_Win_create: Other MPI erro
r, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)
  9:
  9: aborting job:
  9: Fatal error in PMPI_Win_create: Other MPI error, error stack:
  9: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffd68b76660, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffd68b767bc) failed
  9: MPID_Win_create(89).......:
  9: MPIDIG_mpi_win_create(872):
  9: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

Has anyone seen this kind of error before? It is related to Cray/MPI. (@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable ufs-weather-model on Gaea-C6