Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent directory hierarchy #11

Open
jesusff opened this issue Nov 21, 2024 · 12 comments
Open

Inconsistent directory hierarchy #11

jesusff opened this issue Nov 21, 2024 · 12 comments
Labels
problem with data Known problem in Stage-0 data stored at DKRZ

Comments

@jesusff
Copy link
Contributor

jesusff commented Nov 21, 2024

The directory hierarchy is inconsistent at the moment, with half of the simulations providing a version directory (e.g. v20241121) between the variable folder and the actual files:

CORDEX_FPSURBRCC_DKRZ_varlist

And the other half not including it:

CORDEX_FPSURBRCC_DKRZ_varlist_noversion

The protocol says that this version folder should be present (but there are inconsistent examples). In any case, we need to make a decision and change the file hierarchy and/or the protocol accordingly.

Apart from that, there are many other inconsistencies regarding the variable frequencies, version_realization strings, source_ids including the institution, ... (check here for your institution ) . All this needs to be fixed ASAP to go on with the analyses.

@jesusff jesusff added the problem with data Known problem in Stage-0 data stored at DKRZ label Nov 21, 2024
@SiuSunChun
Copy link

I agree with Chus. It will be helpful for us to document our outputs in our community paper.

@jesusff
Copy link
Contributor Author

jesusff commented Dec 2, 2024

The files tracking the status of the stored output have just been updated (607be13). I paste here the variables available for those simulations not including version folder:

image

Those with the version folder appear and are routinely updated here.

@jesusff
Copy link
Contributor Author

jesusff commented Dec 3, 2024

The protocol has been made consistent and all simulation paths should include the version folder.

@LluisFB
Copy link

LluisFB commented Dec 3, 2024

Should all the paths content the same version or can it be institution dependent?

@jesusff
Copy link
Contributor Author

jesusff commented Dec 4, 2024

this is institution-dependent as it is a representative date for a given simulation output: the day it was finished, the day you started uploading the data to the server, or any other meaningful date. It is important not to mix dates if the data come in the same batch. E.g. do not put a different version to a given file just because it was uploaded one day after the rest. Decide a date for a given batch, and use it consistently.

@LluisFB
Copy link

LluisFB commented Dec 4, 2024

Thanks, much more clearer now!
It seems one needs a PhD on CMORization to get it all right ;)

@jesusff
Copy link
Contributor Author

jesusff commented Dec 4, 2024

One needs a PhD and several postdocs on the topic to get it nearly right...

I managed to get at least the paths close to right (in the fixed-drs-paths branch), rebuilding the directory structure and linking the files from the faulty one. With this, I could plot all simulations together:

https://raw.githubusercontent.com/FPS-URB-RCC/STAGE-0_Analysis/refs/heads/fixed-drs-path/docs/CORDEX_FPSURBRCC_DKRZ_varlist.png

The image is devastating, as the matrix should be packed with blue squares. It allows to easily identify naming problems. If your model contributes to make the matrix sparse, you very likely have problems with your data.

@yoselita
Copy link

yoselita commented Dec 5, 2024

@jesusff one suggestion would be to create the same table only with the requested variables from fpsurbrcc following the data-request table, and to compare what we have and what we should have. The groups shared the non-requested variables, and with different names, therefore so many gaps I guess.

jesusff added a commit that referenced this issue Dec 5, 2024
@LluisFB
Copy link

LluisFB commented Dec 5, 2024

So, to understand the figure, the more blue, the better the postprocess?
If one cell is not colored, it does not mean that the variable is not present, is that there is an issue with any of N-requirements of the CMORization?

@LluisFB
Copy link

LluisFB commented Dec 5, 2024

Also, maybe this should be another thread, but again, in order to recognize if a variable is instantaneous or statistically retrieved, one need to open the file and retrieve this information from the attributes, cell_methods, etc...? This is not directly inferred from the name of the file.
For example in ESGF, monthly values are recognized as:
tasmax_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc
This is not our case

@yoselita
Copy link

yoselita commented Dec 5, 2024

All the experiments provide a table with the list of variables that are required by the experiment (CORDEX, FPSCONV, FPS-URBAN etc.). In this table for FPS-URBAN it is explicitly indicated in which form the variable are required to be shaped (column ag). Sometimes it happens that the model provides instantaneous values, but accumulated form is required. In WRF an example for this is surface fluxes - in case of hfss the variable is calculated from HFX like - [HFX(time1)+HFX(time2)]/2.
The idea is that everybody follow these specification, and the final user will know by looking in the table (ag colummn) if the variable is accumulated or instantaneous. Everybody should do the same, so that variables can be comparable.

I think the example you are showing is for a GCM model, where file naming is a bit different. I believe in Amon - A refers to the Atmospheric part of a GCM, and mon is monthly frequency.

@jesusff
Copy link
Contributor Author

jesusff commented Dec 5, 2024

@jesusff one suggestion would be to create the same table only with the requested variables from fpsurbrcc following the data-request table, and to compare what we have and what we should have. The groups shared the non-requested variables, and with different names, therefore so many gaps I guess.

Yes, this would lead to a cleaner table. I will do (#19). However, this will hide not only extra variables/frequencies, but also wrong names which we still need to fix

So, to understand the figure, the more blue, the better the postprocess?
If one cell is not colored, it does not mean that the variable is not present, is that there is an issue with any of N-requirements of the CMORization?

The figure is not that advanced. Colored cells only indicate that a particular variable and frequency (x axis) is available for a particular model and realization (y axis). It only checks the filenames. Many other things can go wrong inside the file. Still, if you look at the x axis, you will see many variables which were not requested and also some with wrong spelling. In the parts of the plot where your model is the only blue row (i.e. no other model is providing this variable), there's likely a problem.

Some problems do not appear, because I already fixed in my script (not in the data!) some mismatches and reported the problems as issues here (e.g. #15 #16 #17 #18).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
problem with data Known problem in Stage-0 data stored at DKRZ
Projects
None yet
Development

No branches or pull requests

4 participants