-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaea C6 support for UFSWM #2448
base: develop
Are you sure you want to change the base?
Conversation
cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware. I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc... |
Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6. |
@BrianCurtis-NOAA, name change suggestion:
|
@BrianCurtis-NOAA Shall I re-try building with these |
tests/compile.sh
Outdated
@@ -95,7 +98,7 @@ export SUITES | |||
set -ex | |||
|
|||
# Valid applications | |||
if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm | |||
if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON
, though I would differ it to @junwang-noaa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?
It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples
yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you do not use tests/compile.sh to build standalone test, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you do not use tests/compile.sh to build standalone test, is that correct?
correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should remove it from compile.sh
cpld_control_p8 fails with:
and control_p8 runs to completion:
|
@DusanJovic-NOAA this look ok?:
|
Yes. |
@BrianCurtis-NOAA @jkbk2004 @FernandoAndrade-NOAA i believe EPIC now has full access to the |
@BrianCurtis-NOAA can you sync up branch? I think I am able to create baseline on c6: /gpfs/f6/bil-fire8/world-shared/role.epic/UFS-WM_RT/NEMSfv3gfs. |
Continue to see failures with various cases.
About 3 different behaviors and error messages:
@ulmononian @RatkoVasic-NOAA we need troubleshooting from lib side. |
@RatkoVasic-NOAA @BrianCurtis-NOAA |
Any combination is OK, as long as they are same length. |
@MichaelLueken just fyi regarding c5/c6 naming conventions. i recall there was a desire to sync the srw ci/cd pipeline w/ certain gaea c5/c6 naming conventions. |
I'll be going with gaeac6 and gaeac5, FYI. I'll make those changes at some point tomorrow. |
@BrianCurtis-NOAA @ulmononian @jkbk2004
Also adding in ./tests/fv3_conf/fv3_slurm.IN_gaea: |
please try what @RatkoVasic-NOAA has suggested in your job cards, before fv3.exe is run: export FI_VERBS_PREFER_XRC=0. this is a known issue inherent to the c5 system. may also try for c6. |
@jkbk2004 @BrianCurtis-NOAA |
@BrianCurtis-NOAA @jkbk2004 @ulmononian
If there is need more work on Gaea C6, I can make PR now. There are only 4 files that needed change, provided here. |
Let me put all of this together and update this PR. |
@jkbk2004 What issue are you seeing in the bmark case? Did SAs suggest this new export variable? |
It looks like FI_CXI_RX_MATCH_MODE=hardware by default and MPI message counters jam up. With hybrid, it runs ok w/o crashing. I am not sure if MPI communication performance with different parameters. I think the other crashing cases (hafs, atmaq) are related to MPI/OFI layers, which may indicate an issue with MPI package installation (some conflict with network hardware supprot). |
@DeniseWorthen I am not sure if increasing resources may help out the crashing cases (win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use). Maybe ulimit -s unlimited as well |
@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen what's the status of testing on this PR? are there specific tests failing that we can help look into? |
@ulmononian pretty much same situation with hafs, cpld control, and regional atmaq cases are crashing with mpich issue. Since main issue is with MPI/OFI, not much we can do on code manager sides. It's lib issue. |
@ulmononian I was not testing the RTs on C6. I was testing for my WW3+PIO PR. |
@DeniseWorthen thanks for clarifying. @jkbk2004 is there a recent rundir or log of which specific tests failed? otherwise, we will need to run the full suite again. |
…al-workflow (#1361) After the recent Gaea-C5 OS upgrade, GDASApp fails to build. This issue corrects Gaea-C5 build and updates the build scripts to conform to ufs-wx-model (following ufs-wx-model ufs-community/ufs-weather-model#2448) and eventual global-workflow updates. Refs NOAA-EMC/global-workflow 3011 NOAA-EMC/global-workflow#3011 Refs NOAA-EMC/global-workflow 3032 NOAA-EMC/global-workflow#3032 Resolves #1360
seems like significant issues with rocoto scheduler on c6. jobs that compile/run fine serially are failing to compile when using rocoto.... based on @RatkoVasic-NOAA's suggestion, i changed the stanza for rocoto in rt.sh to
something is up |
The submodule pointer for upp needs to be updated to get the c5/c6 support that was recently merged into develop. |
@WenMeng-NOAA already created a PR #2489 to update UPP. Maybe we can combine Wen's PR with this one if this is ready to commit. |
As #2489 needs baseline update, I think we can schedule to commit UPP update sometime later next week. |
@RatkoVasic-NOAA @ulmononian Do either of you have a timeline on when things might come together for this? |
Hi @BrianCurtis-NOAA , I'm now running rt.sh with
After finished, we can see what the errors are. |
I spoke too early, here is message from running hafs model: |
In coupled cases there is MPI error like this:
Has anyone seen this kind of error before? It is related to Cray/MPI. (@BrianCurtis-NOAA @jkbk2004 @DeniseWorthen ) |
Commit Queue Requirements:
Description:
This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Testing Log: