Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Load multiple prepared datasets #6

Open
JackKelly opened this issue Feb 18, 2022 · 1 comment
Open

Load multiple prepared datasets #6

JackKelly opened this issue Feb 18, 2022 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@JackKelly
Copy link
Member

JackKelly commented Feb 18, 2022

Detailed Description

A large part of my hope for the ML research we're doing in 2022 is to train across multiple "types" of prepared dataset. For example:

  • Similar to what we used in December: satellite + PV + GSP + NWP (e.g. over Britain) (dataset v17)
  • just satellite (e.g. over the ocean); (dataset v18)
  • satellite + PV + global NWP over UK, Italy and Malta; (dataset v19)
  • (And, if the model overfits, then maybe try training on a few other video prediction datasets like "moving MNIST" or synthetic image sequences of clouds moving)

Context

To train our models to predict future satellite imagery, we probably want to use the entire geographical extent of the satellite imagery.

But we also want to predict PV in the UK, Italy and Malta.

so we might want each batch to contain a mix of examples: some examples will be from the UK (as is the case now), and some examples from anywhere in the geo extent of the satellite imagery (including over oceans) without any PV.

at the moment, nowcasting_dataset can't do this "mixture".

The simplest way to do this might actually be to leave nowcasting_dataset mostly alone, and produce multiple different sets of batches (one set over the UK; the other set without PV data, and from the entire geo extent of the imagery). Then power_perceiver will load multiple batches at once. This has the advantage that we can quickly experiment with dynamically changing the ratio of "UK" to "non-UK" imagery as training progresses.

But this simpler approach still requires that we update nowcasting_dataset a bit (e.g. to randomly sample locations from the entire geo extent of the satellite imagery.)

Possible Implementation

Maybe implement a thin adaptor which holds multiple power_perceiver.NowcastingDataset instances, and itself inherits from torch.utils.data.Dataset. This thin adaptor would sample randomly sample from the upstream power_perceiver.NowcastingDataset instances and stack the Tensors. So for example, if we're combining "just satellite" data and "satellite + PV + GSP + NWP" then, say, the first 16 examples in each batch would be "just satellite", and the first 16 examples for PV, GSP, and NWP would be zeros (and would be masked out before it goes into the Perceiver).

@peterdudfield
Copy link
Contributor

Sounds really great, and good to get it planned out.

I would love to get openclimatefix/nowcasting_dataset#562 into the new datasets, but I unfortunately Ive run out of time before my holiday

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
No open projects
Status: Todo
Development

No branches or pull requests

2 participants