Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

unreleased

Added

add github PR template to guide development process on github #44, @leifdenby

This release adds support for an optional extra section in the config file (for user-defined extra information that is ignored by mllam-data-prep) and fixes a few minor issues. Note that to use extra section in the config file the schema version in the config file must be increased to v0.5.0.

Added

Add optional section called extra to config file to allow for user-defined extra information that is ignored by mllam-data-prep but can be used by downstream applications. , @leifdenby

Changed

remove f-string from name_format in config examples #35
replace global config for dataclass_wizard on mllam_data_prep.config.Config with config specific to that dataclass (to avoid conflicts with other uses of dataclass_wizard) #36
Schema version bumped to v0.5.0 to match release version that supports optional extra section in config #18

v0.4.0

All changes

This release adds support for defining the output path in the command line interface and addresses bugs around optional dependencies for dask.distributed.

Added

add optional output path argument to parser.

Changed

fix bug by making dependency distributed optional
change config example to call validation split val instead of validation #28
fix typo in install dependency distributed
add missing psutil requirement. #21.

v0.3.0

All changes

Added

add support for parallel processing using dask.distributed with command line flags --dask-distributed-local-core-fraction and --dask-distributed-local-memory-fraction to control the number of cores and memory to use on the local machine.

v0.2.0

All changes

Added

add support for creating dataset splits (e.g. train, validation, test) through output.splitting section in the config file, and support for optionally compute statistics for a given split (with output.splitting.splits.{split_name}.compute_statistics). .
include units and long_name attributes for all stacked variables as {output_variable}_units and {output_variable}_long_name .
include version of mllam-data-prep in output

Changed

split dataset creation and storage to zarr into separate functions mllam_data_prep.create_dataset(...) and mllam_data_prep.create_dataset_zarr(...) respectively
changes to spec from v0.1.0:
- the architecture section has been renamed output to make it clearer that this section defines the properties of the output of mllam-data-prep
- sampling_dim removed from output (previously architecture) section of spec, this is not needed to create the training data
- the variables (and their dimensions) of the output definition has been renamed from architecture.input_variables to output.variables
- coordinate value ranges for the dimensions of the output (i.e. what that the architecture expects as input) has been renamed from architecture.input_ranges to output.coord_ranges to make the use more clear
- selection on variable coordinates values is now set with inputs.{dataset_name}.variables.{variable_name}.values rather than inputs.{dataset_name}.variables.{variable_name}.sel
- when dimension-mapping method stack_variables_by_var_name is used the formatting string for the new variable is now called name_format rather than name
- when dimension-mapping is done by simply renaming a dimension this configuration now needs to be set by providing the named method (rename) explicitly through the method key, i.e. rather than {to_dim}: {from_dim} it is now {to_dim}: {method: rename, dim: {from_dim}} to match the signature of the other dimension-mapping methods.
- attribute inputs.{dataset_name}.name attribute has been removed, with the key dataset_name this is superfluous
relax minimuim python version requirement to >3.8 to simplify downstream usage

v0.1.0

First tagged release of mllam-data-prep which includes functionality to declaratively (in a yaml-config file) describe how the variables and coordinates of a set of zarr-based source datasets are mapped to a new set of variables with new coordinates to single a training dataset and write this resulting single dataset to a new zarr dataset. This explicit mapping gives the flexibility to target different different model architectures (which may require different inputs with different shapes between architectures).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

unreleased

Added

v0.5.0

Added

Changed

v0.4.0

Added

Changed

v0.3.0

Added

v0.2.0

Added

Changed

v0.1.0

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

unreleased

Added

v0.5.0

Added

Changed

v0.4.0

Added

Changed

v0.3.0

Added

v0.2.0

Added

Changed

v0.1.0