Add training torchdataset #99

peterdudfield · 2024-12-20T10:50:04Z

We need a torch dataset, that loads premade batches.
This could live in PVnet, but I think it makes sense to live in this repo

The idea is to create a troch dataset that loads netcdf samples in, and change them to torch tensor, ready to be used by PVNet

Credit to @Sukh-P for this suggestion

class NetCDFPremadeSamplesDataset(Dataset):
    """Dataset to load NumpyBatch samples"""

    def __init__(self, sample_dir):
        """Dataset to load NumpyBatch samples

        Args:
            sample_dir: Path to the directory of pre-saved samples.
        """
        self.sample_paths = glob(f"{sample_dir}/*.nc")

    def __len__(self):
        return len(self.sample_paths)

    def __getitem__(self, idx):
        ds = xr.open_dataset(self.sample_paths[idx])
        return self.convert_to_numpy_batch(ds)
    
    def convert_to_numpy_batch(ds: xr.Dataset) -> NumpyBatch:
        da_dict = uncombine_from_single_dataset(ds) # this function is from ocf_datapipes
        numpy_batch = dict_of_arrays_to_numpy_batch(da_dict) # this function would need to be added in ocf-data-sampler, it would look similar to the process and combine except it would not need the normalisation steps
        return numpy_batch

And then in PVnet, we would have to update it to something like this

def _get_premade_samples_dataset(self, subdir) -> Dataset:
        split_dir = f"{self.sample_dir}/{subdir}"
        return NumpybatchPremadeSamplesDataset(split_dir)

The text was updated successfully, but these errors were encountered:

peterdudfield · 2024-12-20T12:20:49Z

Would be good here, to also add datetime features, or should be these added to the batch?

dfulu · 2024-12-23T11:43:45Z

I'm not sure if we need this/if it makes sense to go here. We proposed to make batch (sample) classes in #71. These classes should implement .save(), .load() and .to_numpy() methods.

Then in PVNet all we need is something like

class PresavedSamplesDataset(Dataset):

    def __init__(self, sample_dir, sample_class: SampleAbstractClass):
        """Dataset to load samples from

        Args:
            sample_dir: Path to the directory of pre-saved samples.
        """
        self.sample_paths = glob(f"{sample_dir}/*")
        self.sample_class = sample_class

    def __len__(self):
        return len(self.sample_paths)

    def __getitem__(self, idx):
        sample = self.sample_class.load(self.sample_paths[idx])
        return sample.to_numpy()

peterdudfield · 2024-12-23T11:48:28Z

I guess the idea was to instead of having it in PVnet, have it in ocf-data-sampler. This keeps ml model code in PVnet, and samples stuff here. This means we can test it all here e.t.c

I agree #71 will be a nice tidy up for all of these

dfulu · 2024-12-23T12:17:35Z

Yeh I see what you mean, but I'm not sure where the clean divide is.

In PVNet we have the save_samples.py script which defines the output directory structure for the pre-saved samples. We also have the datamodule which reads from that directory structure. So we could move this Dataset here, but without moving the others (which I don't think we should do) I'm not sure that it makes the divide much cleaner.

I have also been thinking about cross-validation and how we can split the samples after we've saved them. Currently in our backtest we have a model which was trained on 2019-2022 and validated on 2023. We make predictions on 2019-2023 with it for the backtest, so the backtest results are overfit. It we want to run cross validation for the backtest we would need to filter the pre-saved training samples by time. That means we'd need to string together save_samples.py script, the Datamodule and the Dataset in some way to do this. So I actually do think its less complex to have the Dataset in PVNet. Plus the ability to split data and do cross-validation is conceptually a ML code thing rather than a samples thing.

peterdudfield · 2024-12-23T12:58:15Z

I agree something should be done with cross-validation.

I think we should move the Dataset for loading presamples either all here, or all in PVnet. I thinkt he current one for PVnet UK regional is in PVnet and this is much simplier as all it does is load the .pt file (I think).

If we decide to keep it in PVNet, then we should atleast put an example in here. Just to show how we can use it? Maybe something we can chat about in one of the first ML meetings. Good to think about positives and negatives e.t.c

dfulu · 2024-12-23T13:39:23Z

I have a feeling that once we have created the sample class which has a .load() function then it'll be pretty trivial and won't need an example here. But yeh, let's talk about it in the new year

peterdudfield mentioned this issue Dec 20, 2024

2025 ocf-data-sampler - let's make it happen #98

Open

20 tasks

peterdudfield mentioned this issue Dec 20, 2024

Add sample saving for Site Dataset openclimatefix/PVNet#290

Open

peterdudfield self-assigned this Dec 20, 2024

peterdudfield mentioned this issue Dec 20, 2024

first try at adding site premade torch dataset, + tests #102

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add training torchdataset #99

Add training torchdataset #99

peterdudfield commented Dec 20, 2024

peterdudfield commented Dec 20, 2024 •

edited

Loading

dfulu commented Dec 23, 2024 •

edited

Loading

peterdudfield commented Dec 23, 2024 •

edited

Loading

dfulu commented Dec 23, 2024

peterdudfield commented Dec 23, 2024

dfulu commented Dec 23, 2024

Add training torchdataset #99

Add training torchdataset #99

Comments

peterdudfield commented Dec 20, 2024

peterdudfield commented Dec 20, 2024 • edited Loading

dfulu commented Dec 23, 2024 • edited Loading

peterdudfield commented Dec 23, 2024 • edited Loading

dfulu commented Dec 23, 2024

peterdudfield commented Dec 23, 2024

dfulu commented Dec 23, 2024

peterdudfield commented Dec 20, 2024 •

edited

Loading

dfulu commented Dec 23, 2024 •

edited

Loading

peterdudfield commented Dec 23, 2024 •

edited

Loading