-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WCOSS2: pio install does not seem to support pnetcdf #2232
Comments
@HangLei-NOAA would you please check the library on wcoss2 and install a test version of netcdf with pio on acorn for us to test? Thanks |
Apologies for missing to post here, but there is an install Hang has made at: /lfs/h2/emc/eib/save/Hang.Lei/forgdit/nco_wcoss2/install |
@BrianCurtis-NOAA Are you testing Hang's install? |
Yes. I will do more today once I get todays PR started. |
using the updated libraries:
|
@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. |
@Hang-Lei-NOAA the develop branch of ufs-weather-model has the issue, use:
but first remove (or comment out): ufs-weather-model/tests/default_vars.sh Lines 781 to 784 in 47c0099
|
@BrianCurtis-NOAA Please use: Please copy it soon, I will do more sensitivity tests on the UFS to use system installed libs this afternoon after 3pm. Thanks. |
@Hang-Lei-NOAA I can confirm that your lua file works for that test. Please proceed with getting these adjustments made on WCOSS2 Dev. |
@junwang-noaa @BrianCurtis-NOAA $ module -t avail 2>&1 | grep -- "-C/." please test them and let me know if any issues found. Thanks |
I've been able to load those modules and build/compile a test case OK. I am running the full suite now on Acorn using WCOSS2 setup. I will pass along the results as soon as I can. |
@Hang-Lei-NOAA Bongi needed Acorn for other things today, so I was only able to run a subset of tests with the -C libraries but they included tests for cpld, control, regional, 2threads, mpi, restarts, p8, gfsv17, decomp, (the problem case from before) all with success (PASS). Im comfortable saying the -C libraries are OK to use on WCOSS2.
|
Okay, let's push forward. |
@Hang-Lei-NOAA Do you have an estimate about when you think this might be resolved? I'm asking in context of efforts trying to update the global-workflow: NOAA-EMC/global-workflow#2505 and just trying to figure out what the fastest path to updating the model in the global-workflow. Currently the workflow cannot update because HDF5 usage with CICE means you cannot use linked files. I confirmed that this is the same behavior with hdf5 on hera as well. While there are plans to move away from linked files in the global-workflow, it will take some time. So I'm curious if this will be available relatively soon. |
@JessicaMeixner-NOAA Since modifying the netcdf pio Esmf, with netcdf, we delivered it and closely working with GDIT. As my recent check, they said that it will be ready on wcoss2 cactus for testing on this Thursday. It has been very fast. These updates have already been available on acorn. You can start test on acorn. |
Thanks for the information @Hang-Lei-NOAA |
@BrianCurtis-NOAA @junwang-noaa lib-C series are available on CACTUS for testing. |
@Hang-Lei-NOAA apologies if I missed this information elsewhere, but can you share where exactly this new module file is on Cactus for testing? |
I have a modulefile i'm testing, i'll pass it along if all goes well. |
@BrianCurtis-NOAA I think it would be worthwhile to be able to confirm that the G-W, using linked files, is functional. I presume that is the testing that @JessicaMeixner-NOAA could do in parallel with yours. |
@jessica Meixner - NOAA Federal ***@***.***> It is on prod.
It is best to follow Brian's test. He is testing for the whole UFS. So,
just login the system, you will see:
***@***.***:~> module load PrgEnv-intel
***@***.***:~> module load craype
***@***.***:~> module load intel
***@***.***:~> module load cray-mpich
***@***.***:~> module ava
…------------------------------------------------------------------------------------
WCOSS2 Intel Compiled MPI Libraries and Tools
------------------------------------------------------------------------------------
adcirc/v55.10 esmf/7.1.0r fms-C/2023.04
hdf5-C/1.14.0 ncdiag-A/1.1.2 nemsio/2.5.2
netcdf/4.7.4 (D) pio/2.5.10 upp/8.3.0
adcirc/v55.12 (D) esmf/8.0.1 fms/2022.03
hdf5/1.10.6 (D) ncdiag/1.0.0 nemsio/2.5.4 (D)
netcdf/4.9.0 pnetcdf-C/1.12.2 upp/10.0.8 (D)
cdo/1.9.8 (D) esmf/8.1.0 fms/2022.04 (D)
hdf5/1.12.2 ncdiag/1.1.1 (D) nemsiogfs/2.5.3
pio-A/2.5.10 pnetcdf/1.12.2 w3emc/2.7.3
esmf-A/8.4.2 esmf/8.1.1 (D) fms/2023.02.01
mapl-A/2.35.2-esmf-8.4.2 ncio-A/1.1.2 netcdf-A/4.9.2
pio-B/2.5.10 schism/5.11.0 wgrib2/2.0.8_mpi
esmf-B/8.5.0 esmf/8.4.1 hdf5-A/1.14.0
mapl-B/2.40.3 ncio/1.0.0 netcdf-B/4.9.2
pio-C/2.5.10 scotch/7.0.4 wrf_io/1.1.1
esmf-C/8.6.0 fms-A/2023.01 hdf5-B/1.14.0
mapl-C/2.40.3
ncio/1.1.2 (D) netcdf-C/4.9.2 pio/2.5.3
(D) upp/8.2.0 wrf_io/1.2.0 (D)
On Thu, May 2, 2024 at 9:54 AM Brian Curtis ***@***.***> wrote:
I have a modulefile i'm testing, i'll pass it along if all goes well.
—
Reply to this email directly, view it on GitHub
<#2232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFCSJ2SF5EMNM4GHITTZAJAR7AVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGU3DAOBQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Here's what I am using for testing. /lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/modulefiles/ufs_wcoss2.intel.lua |
Thanks @BrianCurtis-NOAA @DeniseWorthen and @Hang-Lei-NOAA. I will test in the g-w this afternoon using the modules from @BrianCurtis-NOAA and will report back how this goes. |
This is what i've got from my UFSWM testing, this also includes FMS 2023.04 and ESMF 8.6.0/MAPL built with that. in compile_atml_debug_intel:
in cpld_control_gfsv17_iau_intel:
cpld_restart_pdlib_p8 intel (finished but interrupted?) @DeniseWorthen I believe the iceh file not reproducing is correct because it switched to pnetcdf this time, correct? The (finished but interrupted) issue i've seen before but it's intermittent and not easily reproduced, rerunning usually is successful. @junwang-noaa should the p8 atmlnd (& sbs) tests be running out of wallclock? It almost seems like it hung somewhere and hit wallclock vs just not being able to complete in time. But i recall some hang issue we've seen before but I'm unsure if it would be remotely related. |
@BrianCurtis-NOAA Just let me if you need anything from my side. We were not having issue in terms of wall clock time with land tests in the past. Right? So, I am not sure why they have issue now. Is this a particular platform? I could also play those tests and reduce the simulation length or I/O if we need. |
@BrianCurtis-NOAA I would not be surprised if the history file was different. Does the nccmp log give any information about the difference? If you want to place the baseline file and the new run file on hera so I can use cprnc, I can do that. |
I can see in the global attributes that the file was created w/ |
@Hang-Lei-NOAA in test: datm_cdeps_lnd_gswp3_intel
|
@BrianCurtis-NOAA I have been testing it too. |
May I ask what's the reason for switching from hdf5 to pnetcdf? Is it faster, or does it produce smaller files? |
@DusanJovic-NOAA From my CICE I/O tests, pnetcdf was much faster. See https://docs.google.com/spreadsheets/d/1xD0-gvbfI2Nwhf-ys_JdQEHR4Wibb0U5hGyVBBEqNUg/edit#gid=93260697 Here, hdf5 is the namelist setting, which writes through the pio hdf5/netcdf4 interface. |
Do you know whether hdf5 was configured to use compression/shuffling and/or chunking. How big are the output files? |
When I was testing the CICE IO, I did not use any chunking or compression. The namelist allows you to set these, for example
CICE output files are not that large ~500MB for a history and ~2G for a restart at 1/4deg. |
As recording by the GDIT helpdesk ticket [Ticket#2024040310000021] Bongi has get |
@Hang-Lei-NOAA |
Bongi provided the recent installation on cactus.
append_path("MODULEPATH",
"/apps/test/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__p-2.5.10__e-8.6.0_pnetcdf/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.12")
I have been testing the UFS build based on the version provided by Brain.
The issue in testing still shows the run over walltime. Reasons need to be
diagnosed.
Although the model runs fine with my builds, Bongi said that he is using my
script in the recent build. Looks like all files are there now.
I start to suspect that some setting in the model may cause this issue.
…On Tue, May 28, 2024 at 2:50 PM Rahul Mahajan ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA>
Is there an update on this issue? This is really backing up a lot of work.
—
Reply to this email directly, view it on GitHub
<#2232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFDV3KHIZQM2QCTG6GTZETGWPAVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVHEYDIMRYGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
First, issue that I suspect is:
Besides, the ufs_wcoss2.intel.lua defines, some components model separately
loading other versions of libraries.
@brian Curtis - NOAA Affiliate ***@***.***> Could you please use
Bongi's installations, and further debug what issues caused the over
walltime running in UFS.
On Tue, May 28, 2024 at 3:01 PM Hang Lei - NOAA Affiliate ***@***.***>
wrote:
… Bongi provided the recent installation on cactus.
append_path("MODULEPATH",
"/apps/test/hpc-stack/i-19.1.3.304__m-8.1.12__h-1.14.0__n-4.9.2__p-2.5.10__e-8.6.0_pnetcdf/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.12")
I have been testing the UFS build based on the version provided by Brain.
The issue in testing still shows the run over walltime. Reasons need to be
diagnosed.
Although the model runs fine with my builds, Bongi said that he is using
my script in the recent build. Looks like all files are there now.
I start to suspect that some setting in the model may cause this issue.
On Tue, May 28, 2024 at 2:50 PM Rahul Mahajan ***@***.***>
wrote:
> @Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA>
> Is there an update on this issue? This is really backing up a lot of work.
>
> —
> Reply to this email directly, view it on GitHub
> <#2232 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AKWSMFDV3KHIZQM2QCTG6GTZETGWPAVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVHEYDIMRYGA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@Hang-Lei-NOAA is this separate from the C-libs? |
@brian Curtis - NOAA Affiliate ***@***.***> I am using your
c-libs version of UFS in testing.
Over walltime running occurs during case
./rt.sh -a GFS-DEV -n "control_p8_atmlnd intel"
…On Tue, May 28, 2024 at 3:55 PM Brian Curtis ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> is this separate from
the C-libs?
—
Reply to this email directly, view it on GitHub
<#2232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFAYAIDTJ4LC7Z3IX2TZETOJLAVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZWGAYDCOJZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA would you please list your code directory so that we can take a look at the modules files and the compile job? Thanks |
@jun Wang - NOAA Federal ***@***.***> my model directory on cactus
is: /lfs/h2/emc/eib/noscrub/hang.lei/c-libs
The modulefile that works using my installation is:
/lfs/h2/emc/eib/noscrub/hang.lei/c-libs/modulefiles/ufs_wcoss2.intel.lua.bak
I am still testing with ufs_wcoss2.intel.lua and do not using it.
…On Wed, May 29, 2024 at 9:48 AM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> would you please list
your code directory so that we can take a look at the modules files and the
compile job? Thanks
—
Reply to this email directly, view it on GitHub
<#2232 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFBUXPDTR7F7CSGNYT3ZEXMAXAVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZXGQ3DCMBYGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I just tried running the
|
Thanks, Denise. @Hang-Lei-NOAA I want to clarify that the ufs_wcoss2.intel.lua is the one with libraries Bongi installed that we are expected to use, right? |
@junwang-noaa It is not. It's my testing one. (2) Bongi's libraries have gradually matching my script. I have solved the netcdf, pio issues in error. But the esmf still have problem. I insist on using my script. He is approaching. I could not tell him what exactly the UFS error is. But fullly match mine will solve the issue. He update it yesterday, but it was wrong. I am waiting his correct updating. My conversation with Bongi is recorded in the helpdesk ticket. |
Thanks @Hang-Lei-NOAA There is a separate email chain w/ ESMF and PIO developers also on-going. I believe the fact we can run this test on wcoss2 using your modules shows this is not related to the input file type or esmf functionality, which makes sense since wcoss2 is the only platform showing this issue. |
@Hang-Lei-NOAA To clarify the issues in 1), in the current ufs-weather-model develop branch, the ufs_common.lua is copied to the modulefiles/. under run directory, but it is not loaded at runtime. In the job_card, we have:
The modules.fv3 is the same as the ufs_wcoss2.intel.lua. |
@junwang-noaa if you set the ufs_common.lua losding esmf/8.5.0 or 8.4.0, but only use ufs_wcoss2.intel.lua loading libraries. Your run will crash due to no esmf/8.5.0 available. |
or you just remove the ufs_common.lua. the runtime err will say no ufs_common.lua to source. |
@Hang-Lei-NOAA Are you running with the latest develop branch on wcoss2? |
I run Brian's copy. |
@BrianCurtis-NOAA do you see the issue 1) Hang mentioned? |
No. |
@junwang-noaa I did comparison experiments this morning. Although his build on esmf have used my settings, but still come to the overwalltime case. All other libs are using his installations. Please refer to the results: We need to find out the reason. Thanks |
I have added ldd in front of fv3.exe. The comparison cannot figure out the difference other than the one that I found in the esmd build log. diff /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_159983/control_p8_atmlnd_intel/out /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_73442/control_p8_atmlnd_intel/out >zzz.log err file can see the loading of libraries, no difference in loadings. Please check if anythings are different. |
Here is the esmf build log file difference : |
Description
PR #2145 brought in a change where CICE switched to use pnetcdf in PIO instead of hdf5. This worked on all machines except WCOSS2.
This leads us to believe that the PIO install on WCOSS2 was not built with proper pnetcdf support.
Efforts are ongoing trying to determine the specific of any build differences between spack-stack and the hpc-stack on WCOSS2.
To Reproduce:
Run cpld_control_gfsv17 intel RT with develop branch of UFSWM (From PR #2145 ) on WCOSS2 dev machine
Needs alongside solving of this issue
The text was updated successfully, but these errors were encountered: