CFizer is a tool to make NetCDF output files from MONC (Met Office NERC cloud model) CF-1.10 compliant (Climate and Forecast metadata convention). It is also able to merge and split datasets according to the number of spatial dimensions, and apply compression to a user-specified level. Please note that although CFizer generates CF-1.10 compliant MONC output, the metadata is labelled as CF-1.8 so that - if required - it can be run through CEDA's cf-checker which, as of July 2024, checks up to CF-1.8 only.
CEDA provides an overview of the CF metadata convention.
Full details of the convention are available here.
Clone this repository and cd into it.
Create an empty Anaconda environment in which to install CFizer, and activate it:
conda create -n cfizer
conda activate cfizer
Alternatively, a pip venv may be used:
python -m venv /path/to/cfizer_env
source /path/to/cfizer_env/bin/activate
If using on JASMIN, load the jaspy
environment: module load jaspy/3.10/r20230718
. This contains all the required dependencies.
If not using on JASMIN, load the following packages in your cfizer conda environment (this has been tested on both ARC4 and ARCHER2):
- netcdf4=1.5.7
- xarray = 2022.3.0
- pyyaml = 6.0
- cfunits = 3.3.4
- numpy = 1.26.3
- dask = 2023.11.0
- openpyxl = 3.0.10
Important: Before installing CFizer, update the configuration file, cfizer/config.yml
and, if necessary, the vocabulary (cfizer/vocabulary.yml
) - see Setup below. Installation creates a copy of these in the virtual environment, so any subsequent changes either need to be made to the version in that environment, or require re-installation.
Install CFizer:
From the top level of the repository type pip install .
The most important setup for users is to check/complete the cfizer/vocabulary.yml
and cfizer/config.yml
files. They define, respectively, how the MONC variables are to be modified for CF compliance, and parameters that should be uniform across a set of files, including the source of the original data.
At the beginning of config.yml
are the global attributes that CF Conventions recommend be present in all datasets:
title
: Set automatically by CFizer, based on file names and any merge/split operations.institution
: This should be provided inconfig.yml
. "Specifies where the original data was produced."source
: Details of how the data were generated, e.g. MONC version ... To be provided inconfig.yml
.history
: Any modifications to the data after generation. Set automatically by CFizer.references
: To be provided inconfig.yml
. "Published or web-based references that describe the data or methods used to produce it."comment
: To be provided inconfig.yml
. "Miscellaneous information about the data or methods used to produce it."conventions
: Set automatically by CFizer.
To run the CFizer on a directory of MONC outputs:
cfize [options] <source_directory>
where <source_directory>
is the path to the directory containing the NC files to be processed.
Option | Required (Y/N)? | Argument | Function | Example |
---|---|---|---|---|
--target_dir , -t |
N | Target directory | Specify directory for processed NetCDF files. Default is to create a sibling directory to the source directory, appending +processed to the same name. |
-t /work/project/diagnostic_outputs/230228 |
--reference_time , -r |
Y | Datetime in ISO format | Date or date-time of origin for time units, in ISO format and as a string ("yyyy-mm-dd[{T/ }hh:mm[:ss][{+/-}hh:mm]]" ). If not specified, CFizer tries to find it in any input file it finds, and requests user input if it fails. If no input file is present and no reference time is specified or is found in existing time units, the software will exit with an error message. |
-r "2020-01-25 00:00+00:00" |
--calendar , -c |
N | Calendar | Calendar to use for time units. See CF Conventions 4.4.1 for valid options. Default: proleptic_gregorian |
--calendar proleptic_gregorian |
--cpus , -p |
N | Integer | Total number of cores to use for parallel processing (including controller process). Default: 1 (serial) | -p 16 |
--keep-interim ,-i |
N | Set if multiple dimension sets are being merged, but you want to keep the single-dimension-set files as well. By default, these redundant files are deleted. | ||
--verbose , -v |
N | Set to report all progress and function timings to stdout. | ||
--quiet , -q |
N | Set to suppress any warnings normally printed to stdout upon completion. |
#!/bin/bash
#SBATCH --partition=test
#SBATCH -n 1
#SBATCH -c 4
#SBATCH -N 1
#SBATCH --job-name=cfizer_example
#SBATCH -o <run_directory>/%j.out
#SBATCH -e <run_directory>/%j.err
#SBATCH --time=60
#SBATCH --mem-per-cpu=16384
# module add jaspy/3.10/r20230718 # If not already running
# conda activate cfizer # If not already active
srun cfize -p 4 -v -i -t <output_directory> <directory_to_process>
The vocab_from_xlsx
tool creates the required vocabulary.yml
file from a Microsoft Excel spreadsheet, providing it is in the expected format:
- Columns:
updated_name
units
(CF-compliant)axis
(CF-compliant)standard_name
(CF-compliant)long_name
dimension_changes
(in formcurrent:new
)perturbation_to_absolute
(value eitherTrue
orFalse
)reference_variable
(forperturbation_to_absolute
only)
Any field can be left blank if not needed (e.g. if field is already CF-compliant or, in the case of variable names, if no change is needed).
Any additional columns will be ignored.
Because the vocabulary is organised by the number of spatial dimensions for a given variable (and output file), the spreadsheet should likewise be divided by dimensions. Before each set of variables should be a row containing only a dimensionality indicator in the first cell, e.g. 0D
for files/variables with only a time dimension and 3D
for file/variables with 3 spatial dimensions as well as time.
xlvocab <path_to_spreadsheet>
The new vocabulary YAML file will be created in the CFizer source code directory.
1.0.0 Initial release.
1.0.1 Allows backwards compatibility with Python 3.9, by removing root_dir
option from iglob
function calls.
1.0.2 Code reformatting using Black, for consistency, and added maintainer to pyproject.toml.
1.0.3 Includes the following fixes: 1) Adds checking for empty DsGroup filepath collections; 2) Adds check for availability of reference variables for converting perturbation values; 3) Corrects Boolean values in vocabulary in Excel spreadsheet to vocabulary tool; 4) Makes some minor corrections and adds some debugging clarifications. README.md updated to include instructions for installation on HPC systems other than JASMIN. Added maintainer to pyproject.toml.
1.0.4 Although CFizer generates CF-1.10 compliant MONC output, CF_VERSION
in startup.py has been relabelled as 1.8 instead of 1.10 (this label becomes part of the CFizer'd output metadata) so that output can be run through CEDA's cf-checker, which currently checks up to CF-1.8 - see issue #58 for further details.
1.0.5 Included fix to cfize_ds.py in order to preserve time dimension when splitting datasets.