Inconsistent directory hierarchy #11

jesusff · 2024-11-21T11:04:46Z

The directory hierarchy is inconsistent at the moment, with half of the simulations providing a version directory (e.g. v20241121) between the variable folder and the actual files:

And the other half not including it:

The protocol says that this version folder should be present (but there are inconsistent examples). In any case, we need to make a decision and change the file hierarchy and/or the protocol accordingly.

Apart from that, there are many other inconsistencies regarding the variable frequencies, version_realization strings, source_ids including the institution, ... (check here for your institution ) . All this needs to be fixed ASAP to go on with the analyses.

SiuSunChun · 2024-11-21T11:33:47Z

I agree with Chus. It will be helpful for us to document our outputs in our community paper.

jesusff · 2024-12-02T10:01:51Z

The files tracking the status of the stored output have just been updated (607be13). I paste here the variables available for those simulations not including version folder:

Those with the version folder appear and are routinely updated here.

jesusff · 2024-12-03T16:16:00Z

The protocol has been made consistent and all simulation paths should include the version folder.

LluisFB · 2024-12-03T19:33:06Z

Should all the paths content the same version or can it be institution dependent?

jesusff · 2024-12-04T08:40:22Z

this is institution-dependent as it is a representative date for a given simulation output: the day it was finished, the day you started uploading the data to the server, or any other meaningful date. It is important not to mix dates if the data come in the same batch. E.g. do not put a different version to a given file just because it was uploaded one day after the rest. Decide a date for a given batch, and use it consistently.

LluisFB · 2024-12-04T12:57:46Z

Thanks, much more clearer now!
It seems one needs a PhD on CMORization to get it all right ;)

jesusff · 2024-12-04T20:20:42Z

One needs a PhD and several postdocs on the topic to get it nearly right...

I managed to get at least the paths close to right (in the fixed-drs-paths branch), rebuilding the directory structure and linking the files from the faulty one. With this, I could plot all simulations together:

https://raw.githubusercontent.com/FPS-URB-RCC/STAGE-0_Analysis/refs/heads/fixed-drs-path/docs/CORDEX_FPSURBRCC_DKRZ_varlist.png

The image is devastating, as the matrix should be packed with blue squares. It allows to easily identify naming problems. If your model contributes to make the matrix sparse, you very likely have problems with your data.

yoselita · 2024-12-05T10:14:42Z

@jesusff one suggestion would be to create the same table only with the requested variables from fpsurbrcc following the data-request table, and to compare what we have and what we should have. The groups shared the non-requested variables, and with different names, therefore so many gaps I guess.

see #18 #11

LluisFB · 2024-12-05T14:52:11Z

So, to understand the figure, the more blue, the better the postprocess?
If one cell is not colored, it does not mean that the variable is not present, is that there is an issue with any of N-requirements of the CMORization?

LluisFB · 2024-12-05T14:55:45Z

Also, maybe this should be another thread, but again, in order to recognize if a variable is instantaneous or statistically retrieved, one need to open the file and retrieve this information from the attributes, cell_methods, etc...? This is not directly inferred from the name of the file.
For example in ESGF, monthly values are recognized as:
tasmax_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc
This is not our case

yoselita · 2024-12-05T15:15:46Z

All the experiments provide a table with the list of variables that are required by the experiment (CORDEX, FPSCONV, FPS-URBAN etc.). In this table for FPS-URBAN it is explicitly indicated in which form the variable are required to be shaped (column ag). Sometimes it happens that the model provides instantaneous values, but accumulated form is required. In WRF an example for this is surface fluxes - in case of hfss the variable is calculated from HFX like - [HFX(time1)+HFX(time2)]/2.
The idea is that everybody follow these specification, and the final user will know by looking in the table (ag colummn) if the variable is accumulated or instantaneous. Everybody should do the same, so that variables can be comparable.

I think the example you are showing is for a GCM model, where file naming is a bit different. I believe in Amon - A refers to the Atmospheric part of a GCM, and mon is monthly frequency.

jesusff · 2024-12-05T16:53:51Z

@jesusff one suggestion would be to create the same table only with the requested variables from fpsurbrcc following the data-request table, and to compare what we have and what we should have. The groups shared the non-requested variables, and with different names, therefore so many gaps I guess.

Yes, this would lead to a cleaner table. I will do (#19). However, this will hide not only extra variables/frequencies, but also wrong names which we still need to fix

So, to understand the figure, the more blue, the better the postprocess?
If one cell is not colored, it does not mean that the variable is not present, is that there is an issue with any of N-requirements of the CMORization?

The figure is not that advanced. Colored cells only indicate that a particular variable and frequency (x axis) is available for a particular model and realization (y axis). It only checks the filenames. Many other things can go wrong inside the file. Still, if you look at the x axis, you will see many variables which were not requested and also some with wrong spelling. In the parts of the plot where your model is the only blue row (i.e. no other model is providing this variable), there's likely a problem.

Some problems do not appear, because I already fixed in my script (not in the data!) some mismatches and reported the problems as issues here (e.g. #15 #16 #17 #18).

jesusff added the problem with data Known problem in Stage-0 data stored at DKRZ label Nov 21, 2024

jesusff mentioned this issue Dec 3, 2024

CMCC institution_model split missing #6

Closed

jesusff added a commit that referenced this issue Dec 5, 2024

Fix some version_realization strings

5baaa9c

see #18 #11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent directory hierarchy #11

Inconsistent directory hierarchy #11

jesusff commented Nov 21, 2024 •

edited

Loading

SiuSunChun commented Nov 21, 2024

jesusff commented Dec 2, 2024

jesusff commented Dec 3, 2024

LluisFB commented Dec 3, 2024

jesusff commented Dec 4, 2024

LluisFB commented Dec 4, 2024

jesusff commented Dec 4, 2024

yoselita commented Dec 5, 2024

LluisFB commented Dec 5, 2024

LluisFB commented Dec 5, 2024

yoselita commented Dec 5, 2024 •

edited

Loading

jesusff commented Dec 5, 2024

Inconsistent directory hierarchy #11

Inconsistent directory hierarchy #11

Comments

jesusff commented Nov 21, 2024 • edited Loading

SiuSunChun commented Nov 21, 2024

jesusff commented Dec 2, 2024

jesusff commented Dec 3, 2024

LluisFB commented Dec 3, 2024

jesusff commented Dec 4, 2024

LluisFB commented Dec 4, 2024

jesusff commented Dec 4, 2024

yoselita commented Dec 5, 2024

LluisFB commented Dec 5, 2024

LluisFB commented Dec 5, 2024

yoselita commented Dec 5, 2024 • edited Loading

jesusff commented Dec 5, 2024

jesusff commented Nov 21, 2024 •

edited

Loading

yoselita commented Dec 5, 2024 •

edited

Loading