Load multiple prepared datasets #6

JackKelly · 2022-02-18T08:50:22Z

Detailed Description

A large part of my hope for the ML research we're doing in 2022 is to train across multiple "types" of prepared dataset. For example:

Similar to what we used in December: satellite + PV + GSP + NWP (e.g. over Britain) (dataset v17)
just satellite (e.g. over the ocean); (dataset v18)
satellite + PV + global NWP over UK, Italy and Malta; (dataset v19)
(And, if the model overfits, then maybe try training on a few other video prediction datasets like "moving MNIST" or synthetic image sequences of clouds moving)

Context

To train our models to predict future satellite imagery, we probably want to use the entire geographical extent of the satellite imagery.

But we also want to predict PV in the UK, Italy and Malta.

so we might want each batch to contain a mix of examples: some examples will be from the UK (as is the case now), and some examples from anywhere in the geo extent of the satellite imagery (including over oceans) without any PV.

at the moment, nowcasting_dataset can't do this "mixture".

The simplest way to do this might actually be to leave nowcasting_dataset mostly alone, and produce multiple different sets of batches (one set over the UK; the other set without PV data, and from the entire geo extent of the imagery). Then power_perceiver will load multiple batches at once. This has the advantage that we can quickly experiment with dynamically changing the ratio of "UK" to "non-UK" imagery as training progresses.

But this simpler approach still requires that we update nowcasting_dataset a bit (e.g. to randomly sample locations from the entire geo extent of the satellite imagery.)

Possible Implementation

Maybe implement a thin adaptor which holds multiple power_perceiver.NowcastingDataset instances, and itself inherits from torch.utils.data.Dataset. This thin adaptor would sample randomly sample from the upstream power_perceiver.NowcastingDataset instances and stack the Tensors. So for example, if we're combining "just satellite" data and "satellite + PV + GSP + NWP" then, say, the first 16 examples in each batch would be "just satellite", and the first 16 examples for PV, GSP, and NWP would be zeros (and would be masked out before it goes into the Perceiver).

The text was updated successfully, but these errors were encountered:

peterdudfield · 2022-02-18T15:20:42Z

Sounds really great, and good to get it planned out.

I would love to get openclimatefix/nowcasting_dataset#562 into the new datasets, but I unfortunately Ive run out of time before my holiday

JackKelly added the enhancement New feature or request label Feb 18, 2022

JackKelly self-assigned this Feb 18, 2022

JackKelly added this to Nowcasting Feb 18, 2022

JackKelly moved this to Todo in Nowcasting Feb 18, 2022

JackKelly mentioned this issue Feb 18, 2022

Enable batches where some examples are from outside the UK (without PV data), and other examples are from the UK (with PV) openclimatefix/nowcasting_dataset#564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load multiple prepared datasets #6

Load multiple prepared datasets #6

JackKelly commented Feb 18, 2022 •

edited

Loading

peterdudfield commented Feb 18, 2022

Load multiple prepared datasets #6

Load multiple prepared datasets #6

Comments

JackKelly commented Feb 18, 2022 • edited Loading

Detailed Description

Context

Possible Implementation

peterdudfield commented Feb 18, 2022

JackKelly commented Feb 18, 2022 •

edited

Loading