HRv4 hangs on orion and hercules #2486

RuiyuSun · 2024-10-30T03:26:48Z

George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.

To Reproduce: Run an HRv4 experiment on Hercules or Orion

Additional context

Output

GeorgeVandenberghe-NOAA · 2024-10-31T18:34:59Z

This happens at high ATM resolution C1152.

RuiyuSun · 2024-11-04T11:38:57Z

I made a HRv4 test run on orion as well. As reported previously, it hung at the beginning of the run.

The log file is at /work2/noaa/stmp/rsun/ROTDIRS/HRv4

HOMEgfs=/work/noaa/global/rsun/git/global-workflow.hr.v4 (source)
EXPDIR=/work/noaa/global/rsun/para_gfs/HRv4
COMROOT=/work2/noaa/stmp/rsun/ROTDIRS
RUNDIRS=/work2/noaa/stmp/rsun/RUNDIRS

LarissaReames-NOAA · 2024-11-05T15:36:26Z

@RuiyuSun Denise reports that the privacy settings on your directories are preventing her from accessing them. Could you check on that and report back when it's fixed so others can look at your forecast?

RuiyuSun · 2024-11-05T16:54:31Z

@DeniseWorthen I made the changes. Please try again.

JessicaMeixner-NOAA · 2024-11-07T15:44:19Z

I've made a few test runs on my end and here are some observations:

This also fails at C768 S2SW
This fails at C1152 S2S (so I do not think this is wave-grid related).

Consistently all runs I have made, also the same as @RuiyuSun runs stall out here:

    0:  fcst_initialize total time:    200.367168849800
    0:  fv3_cap: field bundles in fcstComp export state, FBCount=            8
    0:  af allco wrtComp,write_groups=           4
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to    32768.
 9216:  &MPP_IO_NML
 9216:  HEADER_BUFFER_VAL       =       16384,
 9216:  GLOBAL_FIELD_ON_ROOT_PE = T,
 9216:  IO_CLOCKS_ON    = F,
 9216:  SHUFFLE =           0,
 9216:  DEFLATE_LEVEL   =          -1,
 9216:  CF_COMPLIANCE   = F
 9216:  /
 9216: NOTE from PE     0: MPP_IO_SET_STACK_SIZE: stack size set to     131072.
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 16000000.
 9216:  num_files=           2
 9216:  num_file=           1 filename_base= atm output_file= netcdf_parallel
 9216:  num_file=           2 filename_base= sfc output_file= netcdf_parallel
 9216:  grid_id=            1  output_grid= gaussian_grid
 9216:  imo=        4608 jmo=        2304
 9216:  ideflate=           1
 9216:  quantize_mode=quantize_bitround quantize_nsd=           5
 9216:  zstandard_level=           0
    0:  af wrtState reconcile, FBcount=           8
    0:  af get wrtfb=output_atm_bilinear rc=           0

With high resolution runs (C768 & C1152) for various machines we've had to use different number of write grid tasks. I've tried a few and all are stalling though. This is using ESMF managed threading, so one thing to try might be moving away from that?

To run a high res test case:

git clone --recursive https://github.com/NOAA-EMC/global-workflow
cd global-workflow/sorc
./build_all.sh
./link_workflow.sh
cd ../../
mkdir testdir 
cd testdir 
source ../global-workflow/workflow/gw_setup.sh 
HPC_ACCOUNT=marine-cpu pslot=C1152t02 RUNTESTS=`pwd` ../global-workflow/workflow/create_experiment.py --yaml ../global-workflow/ci/cases/hires/C1152_S2SW.yaml

Change C1152 to C768 to run that resolution and also change your HPC_ACCOUNT, pslot, as desired. Lastly, if you want to turn off waves, you change that in C1152_S2SW.yaml. If you want to change resources, look in global-workflow/parm/config/gfs/config.ufs in the C768/C1152 section.

If you want to run S2S only, change the app in global-workflow/ci/cases/hires/C1152_S2SW.yaml

My latest run log files can be found at:
/work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t0*/COMROOT/C1152t0*/logs/2019120300/gfs_fcst_seg0.log
(several runs are in progress, but they've all been running for over an hour an all hung on the same spot, despite changing write grid tasks).

JessicaMeixner-NOAA · 2024-11-08T13:51:55Z

@GeorgeVandenberghe-NOAA suggested trying 2 write groups with 240 tasks in them. I meant to try that but tried 2 write groups with 360 tasks per group unintentionally, but I did turn on all PET files as @LarissaReames-NOAA thought that might have helpful info.

The rundirectory is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800

The log file is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t06/COMROOT/C1152t06/logs/2019120300/gfs_fcst_seg0.log

The PET logs to me also point to write group issues. Any help with this would be greatly appreciated.

Tagging @aerorahul for awareness.

JacobCarley-NOAA · 2024-11-08T15:42:45Z

Thanks to everyone for the work on this. Has anyone tried this configuration with the write component off? That might help isolate where there problem is (hopefully) and then we can direct this accordingly for further debugging.

JessicaMeixner-NOAA · 2024-11-08T15:44:55Z

I have not tried this without the write component.

DusanJovic-NOAA · 2024-11-08T16:24:43Z

@JessicaMeixner-NOAA and others, I grabbed the run directory from the last experiment you ran (/work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800), changed it to run just ATM component and converted it to run with traditional threading. It is currently running in /work2/noaa/stmp/djovic/stmp/fcst.272800, and it passed the initialization phase and finished writing 000 and 003 hour outputs successfully. I submitted the job with just 30 min wall-clock time limit, so it will fail soon.

I suggest you try running full coupled version with traditional threading if it's easy to reconfigure it.

jiandewang · 2024-11-08T16:30:45Z

some good news:
I tried HR4 tag, the only thing I changed is WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS from 20 to 10 and model is running Note my run is S2S. See log file at
/work/noaa/marine/Jiande.Wang/HERCULES/HR4/work/HR4-20191203/COMROOT/2019120300/HR4-20191203/logs/2019120300/gfsfcst_seg0.log

jiandewang · 2024-11-08T18:57:05Z

my 48hr run finished

JessicaMeixner-NOAA · 2024-11-08T19:15:02Z

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

jiandewang · 2024-11-08T19:19:14Z

I also lanched one S2SW but it's still in pending status

JessicaMeixner-NOAA · 2024-11-08T19:22:34Z

WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 with S2S did not work on orion: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t03/COMROOT/C1152t03/logs/2019120300/gfs_fcst_seg0.log

jiandewang · 2024-11-08T19:24:03Z

mine is on hercules

jiandewang · 2024-11-08T19:26:16Z

@JessicaMeixner-NOAA my gut feeling is the issue is related to the memory/node, hercules has more than orion. Maybe you can try 5 on orion

aerorahul · 2024-11-08T19:29:41Z

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

@DusanJovic-NOAA
To run w/ traditional threading, what else did you update in the test case borrowed from @JessicaMeixner-NOAA?

DusanJovic-NOAA · 2024-11-08T19:45:40Z

I only changed ufs.configure:

remove all components except ATM
change globalResourceControl: from true to false
change ATM_petlist_bounds: to be 0 3023 - this numbers are lower and upper bounds of MPI ranks (0 based) used by the ATM model, in this case 24166 + 2360, where 24 and 16 are layout values from input.nml and 2360 are write comp values from model_configure
change ATM_omp_num_threads: from 4 to 1

And, I added job_card by copying one of the job_card from regression test run and changed:

export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads
#SBATCH --nodes=152
#SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes
152 is the minimal number of nodes such that 152*80 >= 3024

aerorahul · 2024-11-08T19:51:54Z

I only changed ufs.configure:

remove all components except ATM

change globalResourceControl: from true to false

change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 + 2_360, where 24 and 16 are layout values from input.nml and 2_360 are write comp values from model_configure

And, I added job_card by copying one of the job_card from regression test run and changed:

export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads

srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads

#SBATCH --nodes=152
#SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024

Ok. Yes. That makes sense for the atm-only.
Does your ufs.configure have a line for

ATM_omp_num_threads:            @[atm_omp_num_threads]

@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it not matter since globalResourceControl is set to false?

The original value for ATM_petlist_bounds must have been 0 755 that you changed to 0 3023, I am assuming.

GeorgeVandenberghe-NOAA · 2024-11-08T19:56:13Z

OMP_NUM_THREADS performance is i*nconsistent and generally poor if* ATM_omp_num_threads: @[atm_omp_num_threads] is not removed when esmf managed threading is set to false.

…

On Fri, Nov 8, 2024 at 7:52 PM Rahul Mahajan ***@***.***> wrote: I only changed ufs.configure: 1. remove all components except ATM 2. change globalResourceControl: from true to false 3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 + 2_360, where 24 and 16 are layout values from input.nml and 2_360 are write comp values from model_configure And, I added job_card by copying one of the job_card from regression test run and changed: 1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads 2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads 3. #SBATCH --nodes=152 #SBATCH --ntasks-per-node=80 80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024 Ok. Yes. That makes sense for the atm-only. Does your ufs.configure have a line for ATM_omp_num_threads: @[atm_omp_num_threads] @[atm_omp_num_threads] would have been 4. Did you remove it? Or does it not matter since globalResourceControl is set to false? The original value for ATM_petlist_bounds must have been 0 755 that you changed to 0 3023, I am assuming. — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FR2UXPLHUID674GWZLZ7UI7BAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2DCMBSGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

DusanJovic-NOAA · 2024-11-08T19:56:28Z

I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4, I'm not sure if it's ignored when globalResourceControl is set to false

The original value for ATM_petlist_bounds was something like 12 thousand or something like that, that included MPI ranks times 4 threads.

GeorgeVandenberghe-NOAA · 2024-11-08T19:59:08Z

Yes ESMF managed threading requires several times more ranks and ESMF fails when rank count goes above 21000 or so. This is a VERY serious issue for resolution increases unless it is fixed.. reported in February.

…

On Fri, Nov 8, 2024 at 7:56 PM Dusan Jovic ***@***.***> wrote: I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4, I'm not sure if it's ignored when globalResourceControl is set to false The original value for ATM_petlist_bounds was something like 12 thousand or something like that, that included MPI ranks times 4 threads. — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FW3WOMQFATDADHXU53Z7UJQJAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TAMRYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

aerorahul · 2024-11-08T20:01:57Z

@JessicaMeixner-NOAA
I think the global-workflow is coded to use the correct ufs_configure template and set the appropriate values for PETLIST_BOUNDS and OMP_NUM_THREADS in the ufs_configure file.
The default in the global-workflow is to use ESMF_THREADING = YES. I am pretty sure one could use traditional threading as well, but is an unconfirmed fact as there was still work being done to confirm traditional threading will work on WCOSS2 with the slignshot updates and whatnot. Details on that are fuzzy to me at the moment.

BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~

GeorgeVandenberghe-NOAA · 2024-11-08T20:06:11Z

I have MANY test cases that use traditional threading and have converted others from managed to traditional threading. It's generally needed at high resolution to get decent run rates.

…

On Fri, Nov 8, 2024 at 8:02 PM Rahul Mahajan ***@***.***> wrote: @JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> I think the global-workflow is coded to use the correct ufs_configure template and set the appropriate values for PETLIST_BOUNDS and OMP_NUM_THREADS in the ufs_configure file. The default in the global-workflow is to use ESMF_THREADING = YES. I am pretty sure one could use traditional threading as well, but is an unconfirmed fact as there was still work being done to confirm traditional threading will work on WCOSS2 with the slignshot updates and whatnot. Details on that are fuzzy to me at the moment. BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~ — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FVGPKZCGQO7R37N6HLZ7UKE5AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TQMJYHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

aerorahul · 2024-11-08T20:10:32Z

Ok. @GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up.

GeorgeVandenberghe-NOAA · 2024-11-08T20:59:42Z

I don't know because I usually get CWD testcases from others and work from there but yes that's an excellent idea. We probably though should also use a multiple stanza MPI launcher for the different components to minimize core wastage for components that don't thread, particularly WAVE

…

On Fri, Nov 8, 2024 at 8:11 PM Rahul Mahajan ***@***.***> wrote: Ok. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up. — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FQG5MHORVYQWBE3TY3Z7ULE7AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY3TANRTGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JessicaMeixner-NOAA · 2024-11-12T14:23:40Z

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

JessicaMeixner-NOAA · 2024-11-12T14:24:39Z

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

Note this was with added waves - so this might have also failed for @jiandewang if he has used waves.

jiandewang · 2024-11-12T15:14:08Z

summary for more tests I did on HERCULES:
(1) S2S, fv3 layout=8x16, write task per group=10, runs fine, further repeated 3 cases, all fine
(2) same as (1) but layout=24x16, hang
(3) repeat (1) and (2) but S2SW, all hang

DeniseWorthen · 2024-11-16T13:15:27Z

I can confirm that with my DATM-S2SW test and using the points list in Dusan's run directory, I see a initialization time of ~27 minutes

20111001       0       0 WW3 InitializeRealize time:  1631.157

/work2/noaa/stmp/dworthen/stmp/dworthen/ww3pio/datm.hr4

My previous test had used the points list I got from Jiande, which contains only 611 points vs Dusan's 4264.

I had actually looked into this long-point finding time a bit, because George had mentioned it during the scalability meetings. It looks to me that every DE is searching the global domain for whether the global point list is contained w/in the grid. But each DE has a copy of the same ntri triangle list, as well as the global point list. So why does every DE need to do the search? And, for that matter, only the nappnt processor outputs the points, which are all retrieved from global arrays (either global input fields or va). So maybe only nappnt should do the search. I could be wrong, I haven't spent a lot of time on it.

EDIT: Also note that because of the 'negative longitude' problem, WW3 actually only ends up finding ~40% of the available points anyway:

Point output requested for  1624 points

JessicaMeixner-NOAA · 2024-11-16T13:25:05Z

@DeniseWorthen - We have two issues with the long point initialization. 1. some mis-match between grid and points that is not properly taken care of (NOAA-EMC/WW3#1273) and 2. The fact that the search algorithm is not the fastest and even slower if it has to go through every one b/c of the mismatch issue. (NOAA-EMC/WW3#1179). Once issue 1 is solved if there's still an issue my plan is to pre-process this part. This is an issue and is a top priority for me to get fixed ASAP. While the slow initialization needs to be avoided for this debugging work, I believe it's a separate, unrelated issue to the hanging we're seeing on orion/hercules.

Also not to muddy the waters here, but one of the things I'm trying to work on is to run a different grid for the wave model in the g-w and get everything set-up to do some testing - I'm running into issues with the other grid where things are segfaulting on the write grid component. This is a guess - but my guess is we found magic combinations of things that worked and now that we've slightly changed things those combinations no longer work and likely are related to the hanging issues but cannot prove that. I'm trying various combinations of the write grid component tasks to see if I can't find something that works. I am doing this today because I hadn't been able to make a successful run post cactus maintenance (didn't get enough chance to say anything with certainty or rule out user error) but wanted to get things in before the maintenance on dogwood in case wcoss2 runs also became an issue for c1152 post maintenance.

DeniseWorthen · 2024-11-16T13:34:57Z

@JessicaMeixner-NOAA I agree the "hang" is almost certainly due to the point output search.

For the point output search, it is the is_in_ungrid call you reference in your issue (https://github.com/NOAA-EMC/WW3/blob/7705171721e825d58e1e867e552e328fc812bfdd/model/src/w3triamd.F90#L1604) is the one which may need only to be called by the nappnt processor in w3init? Each DE has a copy of the ntri array that is being searched.

    IF ( FLOUT(2) ) CALL W3IOPP ( NPT, XPT, YPT, PNAMES, IMOD )
#ifdef W3_PDLIB
    CALL DEALLOCATE_PDLIB_GLOBAL(IMOD)
#endif

EDIT---oops, just re-read your post. You don't think the hang is related to the the point search. Hmmm....I guess it depends on how long people have waited for before declaring the job "hung"?

JessicaMeixner-NOAA · 2024-11-16T13:45:38Z

@DeniseWorthen - Let me rephrase a little. I agree that the hangs you are seeing are due to the wave initialization - which is in urgent need of a solution. However, we can get around the wave model initialization hang issue by reducing the number of points and ensuring the points we're looking for are from 0 to 360. If you do that, I still think we're going to get model hangs on orion/hercules (for example, we still have hangs on orion with S2S, we did find a combo that worked for S2S on hercules, but I think we'll still have that hang if we add waves back in but with a different point list like I had in my cases with different sets of points). That's why I want to make sure we update the ww3_shel.inp so that we are not having the known wave initialization issue cause problems.

JessicaMeixner-NOAA · 2024-11-16T14:01:44Z

Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday.

JessicaMeixner-NOAA · 2024-11-18T14:18:36Z

Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday.

On WCOSS2 running with a different wave grid, I got a segfault (full log file is on dogwood /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test03/COMROOT/Test03/logs/2020021300/gfs_fcst_seg0.log):

zeroing coupling accumulated fields at kdt=           12
nid001107.dogwood.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps =               11 time is         0.792372
nid001652.dogwood.wcoss2.ncep.noaa.gov 9216:   d3d_on= F
nid001652.dogwood.wcoss2.ncep.noaa.gov: rank 9280 died from signal 9
nid001553.dogwood.wcoss2.ncep.noaa.gov 2056: forrtl: error (78): process killed (SIGTERM)

Rank 9280 is a write grid component task as
ATM_petlist_bounds: 0 10175
ATM_omp_num_threads: 4
layout = 24,16

For this run I had:
write_groups: 4
write_tasks_per_group: 60

Changing this to:
write_groups: 2
write_tasks_per_group: 120

The successful log file is here: /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test04/COMROOT/Test04/logs/2020021300/gfs_fcst_seg0.log

I suspect the issues we see on WCOSS2 are similar to what we've seen on hercules/orion but manifesting in segfaults versus hanging, but I could be wrong.

GeorgeVandenberghe-NOAA · 2024-11-18T14:39:54Z

We should be able to figure out analytically what resources the write grid components require. I was talking to Jun about that. Model state is order ~120GB and UPP makes a second copy for.. reasons. THat gets us to 240GB and UPP scratch spaces may eat up another 100GB or so spread through the nodes used by the write grid component. WCOSS2 has 512GB so that should be easily enough and I am puzzled it's not. One think that helps is to make sure only one write grid component is on a node and to do that the ranks per I/O group should be an integral multiple of ppn which is typically 128/4 for these runs. 120 and 60 don't meet this requirement. 128 and 64 (and 32) do. On hercules/orion it should be 40/cpus-per-task, typically 20 or 10.

…

On Mon, Nov 18, 2024 at 9:19 AM Jessica Meixner ***@***.***> wrote: Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday. On WCOSS2 running with a different wave grid, I got a segfault (full log file is on dogwood /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test03/COMROOT/Test03/logs/2020021300/gfs_fcst_seg0.log): zeroing coupling accumulated fields at kdt= 12nid001107.dogwood.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 11 time is 0.792372nid001652.dogwood.wcoss2.ncep.noaa.gov 9216: d3d_on= Fnid001652.dogwood.wcoss2.ncep.noaa.gov: rank 9280 died from signal 9nid001553.dogwood.wcoss2.ncep.noaa.gov 2056: forrtl: error (78): process killed (SIGTERM) Rank 9280 is a write grid component task as ATM_petlist_bounds: 0 10175 ATM_omp_num_threads: 4 layout = 24,16 For this run I had: write_groups: 4 write_tasks_per_group: 60 Changing this to: write_groups: 2 write_tasks_per_group: 120 The successful log file is here: /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test04/COMROOT/Test04/logs/2020021300/gfs_fcst_seg0.log I suspect the issues we see on WCOSS2 are similar to what we've seen on hercules/orion but manifesting in segfaults versus hanging, but I could be wrong. — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FX2L53JUS2W6H2DR5T2BHZNHAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGE3TKNBXGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

DeniseWorthen · 2024-11-18T22:34:33Z

In the job cards I got from Rahul's sandboxes, the nodes are specified as either

#SBATCH --nodes=63-63

for C768 and

#SBATCH --nodes=82-82

for C1152.

I'm not familiar w/ this notation. What does 82-82 mean?

GeorgeVandenberghe-NOAA · 2024-11-18T23:18:06Z

This looks like a gaea6 job card. Maybe useful on gaea5 or maybe now slurm supports this everywhere and I just didn't know it

…

On Mon, Nov 18, 2024 at 5:35 PM Denise Worthen ***@***.***> wrote: In the job cards I got from Rahul's sandboxes, the nodes are specified as either #SBATCH --nodes=63-63 for C768 and #SBATCH --nodes=82-82 for C1152. I'm not familiar w/ this notation. What does 82-82 mean? — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FUBXA4PRAUKIE35B6L2BJTRDAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUGI3TINZVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

DeniseWorthen · 2024-11-18T23:28:53Z

@GeorgeVandenberghe-NOAA Since you seem to know, what does specifying the nodes like this mean?

GeorgeVandenberghe-NOAA · 2024-11-18T23:55:05Z

I have no idea.

…

On Mon, Nov 18, 2024 at 6:29 PM Denise Worthen ***@***.***> wrote: @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> Since you seem to know, what does specifying the nodes like this mean? — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FQNGG4AMPMVJSDZY4T2BJZ4XAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUGM4DCMRRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA · 2024-11-19T18:52:26Z

In the job cards I got from Rahul's sandboxes, the nodes are specified as either
#SBATCH --nodes=63-63
for C768 and
#SBATCH --nodes=82-82
for C1152.

I'm not familiar w/ this notation. What does 82-82 mean?

Hi @DeniseWorthen. Here's the relevant snippet from the slurm documentation on sbatch:

-N, --nodes=[-maxnodes]|<size_string>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. Node count can be also specified as size_string. The size_string specification identifies what nodes values should be used. Multiple values may be specified using a comma separated list or with a step function by suffix containing a colon and number values with a "-" separator. For example, "--nodes=1-15:4" is equivalent to "--nodes=1,5,9,13". The partition's node limits supersede those of the job. If a job's node limits are outside of the range permitted for its associated partition, the job will be left in a PENDING state. This permits possible execution at a later time, when the partition limit is changed. If a job node limit exceeds the number of nodes configured in the partition, the job will be rejected. Note that the environment variable SLURM_JOB_NUM_NODES will be set to the count of nodes actually allocated to the job. See the ENVIRONMENT VARIABLES section for more information. If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requested resources as expressed by per-job specification options, e.g. -n, -c and --gpus. The job will be allocated as many nodes as possible within the range specified and without delaying the initiation of the job. The node count specification may include a numeric value followed by a suffix of "k" (multiplies numeric value by 1,024) or "m" (multiplies numeric value by 1,048,576).
NOTE: This option cannot be used in with arbitrary distribution.

So, I'm pretty sure it's just specifying the minimum and maximum number of nodes the job can run with. In this case they are the same.

DeniseWorthen · 2024-11-19T19:08:50Z

@JacobCarley-NOAA Thanks. That makes sense.

DeniseWorthen · 2024-11-20T13:35:25Z

I copied Rahul's C768 run directory (also created my own fix subdir) and compiled both top-develop and the HR4 tag in debug mode using

./compile.sh hercules "-DAPP=S2SW -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON -DDEBUG=ON" s2sw.dev.db intel  NO NO 2>&1 | tee s2sw.dev.db.log

When I try c768s2sw_gfsfcst.sh, both dev and the tag give me a seg fault (they don't even start):

 159: [hercules-01-36:826937:0:826937] Caught signal 8 (Floating point exception: floating-point invalid operation)
6082: ==== backtrace (tid: 630246) ====
6082:  0 0x000000000005f14c ucs_callbackq_cleanup()  ???:0
6082:  1 0x000000000005f40a ucs_callbackq_cleanup()  ???:0
6082:  2 0x0000000000054d90 __GI___sigaction()  :0
6082:  3 0x0000000000048f52 ucp_proto_perf_envelope_make()  ???:0
6082:  4 0x0000000000054bbc ucp_proto_select_elem_trace()  ???:0
6082:  5 0x0000000000056261 ucp_proto_select_lookup_slow()  ???:0
6082:  6 0x0000000000056725 ucp_proto_select_short_init()  ???:0
6082:  7 0x000000000004bc1c ucp_worker_add_rkey_config()  ???:0
6082:  8 0x00000000000648ff ucp_proto_rndv_ctrl_init()  ???:0
6082:  9 0x0000000000064aff ucp_proto_rndv_rts_init()  ???:0
6082: 10 0x0000000000054a42 ucp_proto_select_elem_trace()  ???:0
6082: 11 0x0000000000056261 ucp_proto_select_lookup_slow()  ???:0
6082: 12 0x0000000000056725 ucp_proto_select_short_init()  ???:0
6082: 13 0x000000000004b789 ucp_worker_get_ep_config()  ???:0
6082: 14 0x00000000000a159c ucp_wireup_init_lanes()  ???:0
6082: 15 0x00000000000339ce ucp_ep_create_to_worker_addr()  ???:0
6082: 16 0x0000000000034b33 ucp_ep_create()  ???:0
6082: 17 0x00000000000078bb mlx_av_insert()  mlx_av.c:0
6082: 18 0x00000000006595fb fi_av_insert()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_domain.h:414
6082: 19 0x00000000006595fb insert_addr_table_roots_only()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:448
6082: 20 0x00000000006595fb MPIDI_OFI_mpi_init_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:1604
6082: 21 0x00000000002296f4 MPID_Init()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1544
6082: 22 0x00000000004ce935 MPIR_Init_thread()  /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:175
6082: 23 0x00000000004ce935 PMPI_Init_thread()  /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:318
6082: 24 0x000000000117376d ESMCI::VMK::init()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:423
6082: 25 0x00000000012f9e3f ESMCI::VM::initialize()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:3200
6082: 26 0x00000000009da3c5 c_esmc_vminitialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/interface/ESMCI_VM_F.C:1186
6082: 27 0x0000000000cc6810 esmf_vmmod_mp_esmf_vminitialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/interface/ESMF_VM.F90:9321
6082: 28 0x0000000000b1bc47 esmf_initmod_mp_esmf_frameworkinternalinit_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/ESMFMod/src/ESMF_Init.F90:711
6082: 29 0x0000000000b2140e esmf_initmod_mp_esmf_initialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/ESMFMod/src/ESMF_Init.F90:401
6082: 30 0x0000000000431e9c MAIN__()  /work/noaa/nems/dworthen/ufs-weather-model/driver/UFS.F90:97
6082: 31 0x0000000000431abd main()  ???:0
6082: 32 0x000000000003feb0 __libc_start_call_main()  ???:0
6082: 33 0x000000000003ff60 __libc_start_main_alias_2()  :0
6082: 34 0x00000000004319d5 _start()  ???:0
6082: =================================

Run directory (hercules): /work2/noaa/stmp/dworthen/c768s2sw.2

GeorgeVandenberghe-NOAA · 2024-11-20T15:25:35Z

For hercules my current snapshot is on/work2/noaa/noaatest/gwv/herc/hr4j/the rundir is ./dc the source dir is ./sorc and
to build, I cd to ./sorc/ufs_model.fd, load the compilers and set
export PREFIX=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240
export NETP=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240
export CMAKE_PREFIX_PATH=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240
export ESMFMKFILE=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240/ESMF_8_5_0/lib/esmf.mk
With this done, the following script
rm -rf build
mkdir build
cd build
export CMAKE_PREFIX_PATH=$NETP/fms.2024.01:$NETP
cmake .. -DAPP=S2SWA -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_p8_ugwpv1,FV3_GFS_v17_coupled_p8_ugwpv1,FV3_global_nest_v1 -DPDLIB=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON
make -j 8 VERBOSE=1
builds it. I am sick and tired of broken stacks and just gave up and built my own :-( However I do think this would work with the current Hercules spack-stack.. haven't tried it.

DeniseWorthen · 2024-11-20T21:15:27Z

I checked again that my configuration was a copy of Rahul's c768 run directory. I used the compile in debug mode and it fails immediately w/ the error I posted above. That run directory is /work2/noaa/stmp/dworthen/c768s2sw

I then used Dusan's instructions posted earlier for using traditional threading. He did it by removing all other components except ATM, so I made a similar adjustment w/ all components included. Using the same executable, it ran for 25 minutes of calendar time. That run directory is /work2/noaa/stmp/dworthen/c768s2sw.2. I used here job_card so check the out and err files.

I haven't yet tried the 2nd case w/ a non-debug compile. I did confirm that the first case hangs w/ a release compile.

Also, I made the WW3 points list only 240 long in both cases. (See ww3_shel.nml, which is being used in my tests since it is easy then to point to a different point list.)

GeorgeVandenberghe-NOAA · 2024-11-21T18:50:41Z

Okay on Hercules
24x32 ATM decomposition, two threads per task ESMF RESOURCE CONTROL FALSE , 4 I/O groups, 160 MPI ranks per group. 240 OCN tasks, 120 ICE tasks 1400 WAVE tasks. 32 tasks per node

On Orion 24x24 ATM decomposition 2 groups of 240 I/O tasks 240 OCN 120 ICE 998 WAVE tasks. 2 threads per task. 16 tasks per node

GeorgeVandenberghe-NOAA · 2024-11-21T18:51:13Z

The hangs I reported earlier seem to happen at higher decompositions and resource usages. Running that down.

GeorgeVandenberghe-NOAA · 2024-11-21T18:51:46Z

The problem what we can't quickly find WHERE in the various component(s) we are getting stuck remains an issue.

DeniseWorthen · 2024-11-22T18:38:21Z

Based on my testing, the issue seems to be fundamentally one w/ using ESMF managed threading. I've been doing all my testing in /work2/noaa/stmp/dworthen/hangtests, with sub-dirs there for ESMF-managed threading (ESMFT) and traditional threading (TRADT).

I can run the test case w/ traditional threading w/ the G-W executable (from Rahu's sandbox), with my own compile and with my own debug compile.

I cannot run with ESMF managed threading either w/ the G-W executable, my own compile or my own debug compile. I've tried w/ and w/o waves. In all cases, I either get a hang or, with debug, I get the error I posted about regarding floating point exception.

@JacobCarley-NOAA I think at this point it is not a WAV issue, assuming you reduce the points list to something small. I think others are better suited to debugging it. That will allow me to return my focus to the grid-imprint issue (#2466), which I know is also very high priority.

BrianCurtis-NOAA · 2024-11-22T18:43:39Z

I wonder if there was a build option missed that is causing managed threading to not work correctly?

BrianCurtis-NOAA · 2024-11-22T18:44:27Z

What I mean is with the ESMF library built in those stacks.

JessicaMeixner-NOAA · 2024-11-22T18:47:33Z

@JacobCarley-NOAA - as a near-term work around I plan to request a feature in the global-workflow to add traditional threading to enable orion/hercules in the near term, unless you'd prefer a different path forward?

GeorgeVandenberghe-NOAA · 2024-11-22T19:09:12Z

I get it at high rank counts (8000 or so) without managed threading on orion.

…

On Fri, Nov 22, 2024 at 6:44 PM Brian Curtis ***@***.***> wrote: I wonder if there was a build option missed that is causing managed threading to not work correctly? — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FWZ5AD4TRRHV5NFW3L2B53PDAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUGUYTOOJRGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA · 2024-11-22T20:10:18Z

@DeniseWorthen Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466).

@JessicaMeixner-NOAA I think the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term.

Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA Would you be able to continue digging into this issue?

JessicaMeixner-NOAA · 2024-11-22T20:12:35Z

@JacobCarley-NOAA a comment from @aerorahul earlier in this thread:

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

I'll open a g-w issue (update: g-w issue: NOAA-EMC/global-workflow#3122)

GeorgeVandenberghe-NOAA · 2024-11-22T20:27:16Z

I intend to but if I encounter hangs I need people who know the component codes to figure out where and why the hangs are occurring. Debugging is very slow on Orion where I have encountered a hang with 7008 mpi ranks, 1400 wave ranks and 24x32 atm decomposition WITHOUT esmf managed threading. It looks like an issue with large numbers of ranks which we get first with ESMF managed threading but eventually at higher resolution, without this setting too. This is DIFFERENT from the ESMF bug where we still can't spawn more than 21K ranks without a segfault in the ESMF code somewhere.

…

On Fri, Nov 22, 2024 at 8:10 PM JacobCarley-NOAA ***@***.***> wrote: @DeniseWorthen <https://github.com/DeniseWorthen> Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466 <#2466>). @JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> I *think* the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277 <NOAA-EMC/global-workflow#2277>). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term. Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> Would you be able to continue digging into this issue? — Reply to this email directly, view it on GitHub <#2486 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FTJAR5H5W2G2BW2YPT2B6FUFAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUG4YTKMBWGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA · 2024-11-22T20:28:30Z

Thanks @GeorgeVandenberghe-NOAA! Just send me a quick note offline (email is fine) when you need a component expert to jump in and I'll be happy to coordinate accordingly.

GeorgeVandenberghe-NOAA · 2024-11-25T18:39:58Z

It looks like the hangs are related to the total number of WAVE tasks but are also related to total resource usage.

I have verified that a 16x16 decomposition (ATM) with traditional threads (two per rank) and 1400 wave ranks does not hang on either Orion or Hercules but a 24x32 decomposition with 1400 wave ranks does. 998 rank runs do get through with a 24x32 decomposition. So it looks like total job resources is a contributing issue. It isn't just a hard barrier that we can't run 1400 wave tasks on orion or hercules.

RuiyuSun added the bug Something isn't working label Oct 30, 2024

JessicaMeixner-NOAA mentioned this issue Nov 22, 2024

Traditional threading as option for forecast model NOAA-EMC/global-workflow#3122

Open

HRv4 hangs on orion and hercules #2486

HRv4 hangs on orion and hercules #2486

Comments

RuiyuSun commented Oct 30, 2024 • edited Loading

To Reproduce: Run an HRv4 experiment on Hercules or Orion

Additional context

Output

GeorgeVandenberghe-NOAA commented Oct 31, 2024

RuiyuSun commented Nov 4, 2024

LarissaReames-NOAA commented Nov 5, 2024

RuiyuSun commented Nov 5, 2024

JessicaMeixner-NOAA commented Nov 7, 2024

JessicaMeixner-NOAA commented Nov 8, 2024

JacobCarley-NOAA commented Nov 8, 2024

JessicaMeixner-NOAA commented Nov 8, 2024

DusanJovic-NOAA commented Nov 8, 2024

jiandewang commented Nov 8, 2024

jiandewang commented Nov 8, 2024

JessicaMeixner-NOAA commented Nov 8, 2024

jiandewang commented Nov 8, 2024

JessicaMeixner-NOAA commented Nov 8, 2024

jiandewang commented Nov 8, 2024

jiandewang commented Nov 8, 2024

aerorahul commented Nov 8, 2024

DusanJovic-NOAA commented Nov 8, 2024 • edited Loading

aerorahul commented Nov 8, 2024

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

DusanJovic-NOAA commented Nov 8, 2024

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

aerorahul commented Nov 8, 2024

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

aerorahul commented Nov 8, 2024

GeorgeVandenberghe-NOAA commented Nov 8, 2024 via email

JessicaMeixner-NOAA commented Nov 12, 2024

JessicaMeixner-NOAA commented Nov 12, 2024

jiandewang commented Nov 12, 2024

DeniseWorthen commented Nov 16, 2024 • edited Loading

JessicaMeixner-NOAA commented Nov 16, 2024

DeniseWorthen commented Nov 16, 2024 • edited Loading

JessicaMeixner-NOAA commented Nov 16, 2024 • edited Loading

JessicaMeixner-NOAA commented Nov 16, 2024

JessicaMeixner-NOAA commented Nov 18, 2024

GeorgeVandenberghe-NOAA commented Nov 18, 2024 via email

DeniseWorthen commented Nov 18, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Nov 18, 2024 via email

DeniseWorthen commented Nov 18, 2024

GeorgeVandenberghe-NOAA commented Nov 18, 2024 via email

JacobCarley-NOAA commented Nov 19, 2024

DeniseWorthen commented Nov 19, 2024

DeniseWorthen commented Nov 20, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Nov 20, 2024

DeniseWorthen commented Nov 20, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Nov 21, 2024

GeorgeVandenberghe-NOAA commented Nov 21, 2024

GeorgeVandenberghe-NOAA commented Nov 21, 2024

DeniseWorthen commented Nov 22, 2024

BrianCurtis-NOAA commented Nov 22, 2024

BrianCurtis-NOAA commented Nov 22, 2024

JessicaMeixner-NOAA commented Nov 22, 2024

GeorgeVandenberghe-NOAA commented Nov 22, 2024 via email

JacobCarley-NOAA commented Nov 22, 2024

JessicaMeixner-NOAA commented Nov 22, 2024 • edited Loading

GeorgeVandenberghe-NOAA commented Nov 22, 2024 via email

JacobCarley-NOAA commented Nov 22, 2024

GeorgeVandenberghe-NOAA commented Nov 25, 2024

RuiyuSun commented Oct 30, 2024 •

edited

Loading

DusanJovic-NOAA commented Nov 8, 2024 •

edited

Loading

DeniseWorthen commented Nov 16, 2024 •

edited

Loading

DeniseWorthen commented Nov 16, 2024 •

edited

Loading

JessicaMeixner-NOAA commented Nov 16, 2024 •

edited

Loading

DeniseWorthen commented Nov 18, 2024 •

edited

Loading

DeniseWorthen commented Nov 20, 2024 •

edited

Loading

DeniseWorthen commented Nov 20, 2024 •

edited

Loading

JessicaMeixner-NOAA commented Nov 22, 2024 •

edited

Loading