Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature test issues for rrfs_smoke_conus13km_hrrr_warm #1222

Open
junwang-noaa opened this issue May 17, 2022 · 21 comments
Open

feature test issues for rrfs_smoke_conus13km_hrrr_warm #1222

junwang-noaa opened this issue May 17, 2022 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented May 17, 2022

Description

PR #1195 added a feature test rrfs_smoke_conus13km_hrrr_warm using suite file FV3_HRRR_smoke. The test owner needs to confirm that the feature test can reproduce results with different threads, decomposition, mpi tasks and in restart mode. It can also run in debug mode. Currently the test failed with decomposition and debug test.

To Reproduce:

Check out the branch in PR#1195, run rrfs_smoke_conus13km_hrrr_warm with different threading, decomposition, mpi tasks, in restart mode and debug mode.

Additional context

Add any other context about the problem here.
Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

Output

@junwang-noaa
Copy link
Collaborator Author

The issue was fixed in PR#1257. The issue will be closed.

@SamuelTrahanNOAA
Copy link
Collaborator

@junwang-noaa This was NOT fixed in #1257. Please re-open this issue so I don't have to make a new one.

@junwang-noaa junwang-noaa reopened this Jul 13, 2022
@junwang-noaa
Copy link
Collaborator Author

junwang-noaa commented Jul 13, 2022

Sorry, I see the PR #1257 fixed the reproducibility for hrrr_control, not rrfs_smoke_conus13km_hrrr_warm.

@SamuelTrahanNOAA
Copy link
Collaborator

Actually, the hrrr_control variants already worked, they just weren't enabled. The reproducibility fix in that PR was for the rap_decomp.

@DeniseWorthen
Copy link
Collaborator

Can this issue be closed @junwang-noaa @SamuelTrahanNOAA ?

@SamuelTrahanNOAA
Copy link
Collaborator

No. This problem is not resolved.

@SamuelTrahanNOAA
Copy link
Collaborator

I can fix the debug and 2threads variants in this PR: #1437 Sadly, as yet, I have no fix for the restart or decomp variants.

However, I suspect this bug may be breaking decomp: #1436 if it is using data from halo regions. I have no way to fix that bug, nor even confirm my suspicions, since that code goes well beyond my understanding of the boundary generation.

@zach1221
Copy link
Collaborator

zach1221 commented Jun 2, 2023

I decided to test rrfs_smoke_conus13km_hrrr_warm with the various features decomposition, restart mode, and mpi, (I know debug and 2threads should now be passing with the merging of #1437 ) and it seems everything passed. @SamuelTrahanNOAA have you had the opportunity to test again recently?

@SamuelTrahanNOAA
Copy link
Collaborator

They fail for me. How did you test?

You need to use the tests/tests files, not just change environment variables. The RRFS tests ignore several environment variables, and they're always warm starts.

@SamuelTrahanNOAA
Copy link
Collaborator

The RRFS has hard-coded values for some variables. If you're using an automated tool that tweaks variables, it won't test anything.

These values are hard-coded:

export INPES=12
export JNPES=12
export WARM_START=.true.

All RRFS runs are warm starts.

To do a restart test, you need to set RRFS_RESTART=YES.
For a decomposition test, you need a different tests/tests file with different values for INPES and JNPES.

@SamuelTrahanNOAA
Copy link
Collaborator

I just retested hera.gnu and I can confirm the situation is unchanged. I'd like to know how @zach1221 ran the tests. This is not the first time someone has configured the RRFS tests incorrectly and falsely reported that the restart and decomp work. Is the tool "opnReqTest?" If so, I'll add an "if" statement to rrfs_warm_run.IN to abort the test if that tool is enabled.

@zach1221
Copy link
Collaborator

zach1221 commented Jun 2, 2023

@SamuelTrahanNOAA I see. Well I guess I tested incorrectly. I was just running the tests sequentially out of rt.conf in tests/.
Like, ./rt.sh -a nems -n rrfs_smoke_conus13km_hrrr_warm_debug_decomp intel or ./rt.sh -a nems rrfs_smoke_conus13km_hrrr_warm_restart, etc.

I'll try again with the steps you provided to reproduce. Thank you!

@SamuelTrahanNOAA
Copy link
Collaborator

The I haven't tried that before.

@SamuelTrahanNOAA
Copy link
Collaborator

SamuelTrahanNOAA commented Jun 2, 2023

Use this:

COMPILE | 13 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_flake,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON | | fv3 |

RUN | rrfs_smoke_conus13km_hrrr_warm                    |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_2threads           |                            |          |
RUN | rrfs_conus13km_hrrr_warm                          |                            | baseline |
RUN | rrfs_smoke_conus13km_radar_tten_warm              |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_decomp            |                            |          |
RUN | rrfs_smoke_conus13km_hrrr_warm_restart           |                            |          | rrfs_smoke_conus13km_hrrr_warm
RUN | rrfs_conus13km_hrrr_warm_restart_mismatch         |                            | baseline | rrfs_conus13km_hrrr_warm

@zach1221
Copy link
Collaborator

zach1221 commented Jun 2, 2023

@SamuelTrahanNOAA thanks, again. Let me try that now.

@SamuelTrahanNOAA
Copy link
Collaborator

My branch was not up-to-date with develop, so that test didn't check if the latest version works. It seems the regression test system has changed substantially. I'll have to check if it's even running those tests correctly.

@SamuelTrahanNOAA
Copy link
Collaborator

The 2threads test doesn't use 2 threads anymore, but the decomp test still changes the decomposition.

@SamuelTrahanNOAA
Copy link
Collaborator

The restart and decomp do not match the control, but they are executed correctly.

It looks like the 2threads is using ESMF to turn on threading, without providing the mandatory OMP_NUM_THREADS variable that sets the maximum number of threads available to ESMF. I will try correcting this and see if it still passes.

@SamuelTrahanNOAA
Copy link
Collaborator

The 2threads test still passes if I set OMP_NUM_THREADS (THRD) to 2

@SamuelTrahanNOAA
Copy link
Collaborator

The debug_decomp test (rrfs_smoke_conus13km_hrrr_warm_debug_decomp_intel) also fails.

@zach1221
Copy link
Collaborator

Hi, @SamuelTrahanNOAA . This issue is still under investigation, we'll attempt to keep you updated regularly going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Status: No status
Status: No status
Status: 📋 Backlog
Development

No branches or pull requests

4 participants