Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add training torchdataset #99

Open
peterdudfield opened this issue Dec 20, 2024 · 6 comments
Open

Add training torchdataset #99

peterdudfield opened this issue Dec 20, 2024 · 6 comments
Assignees

Comments

@peterdudfield
Copy link
Contributor

We need a torch dataset, that loads premade batches.
This could live in PVnet, but I think it makes sense to live in this repo

The idea is to create a troch dataset that loads netcdf samples in, and change them to torch tensor, ready to be used by PVNet

Credit to @Sukh-P for this suggestion

class NetCDFPremadeSamplesDataset(Dataset):
    """Dataset to load NumpyBatch samples"""

    def __init__(self, sample_dir):
        """Dataset to load NumpyBatch samples

        Args:
            sample_dir: Path to the directory of pre-saved samples.
        """
        self.sample_paths = glob(f"{sample_dir}/*.nc")

    def __len__(self):
        return len(self.sample_paths)

    def __getitem__(self, idx):
        ds = xr.open_dataset(self.sample_paths[idx])
        return self.convert_to_numpy_batch(ds)
    
    def convert_to_numpy_batch(ds: xr.Dataset) -> NumpyBatch:
        da_dict = uncombine_from_single_dataset(ds) # this function is from ocf_datapipes
        numpy_batch = dict_of_arrays_to_numpy_batch(da_dict) # this function would need to be added in ocf-data-sampler, it would look similar to the process and combine except it would not need the normalisation steps
        return numpy_batch

And then in PVnet, we would have to update it to something like this

def _get_premade_samples_dataset(self, subdir) -> Dataset:
        split_dir = f"{self.sample_dir}/{subdir}"
        return NumpybatchPremadeSamplesDataset(split_dir)
@peterdudfield
Copy link
Contributor Author

peterdudfield commented Dec 20, 2024

Would be good here, to also add datetime features, or should be these added to the batch?

@dfulu
Copy link
Member

dfulu commented Dec 23, 2024

I'm not sure if we need this/if it makes sense to go here. We proposed to make batch (sample) classes in #71. These classes should implement .save(), .load() and .to_numpy() methods.

Then in PVNet all we need is something like

class PresavedSamplesDataset(Dataset):

    def __init__(self, sample_dir, sample_class: SampleAbstractClass):
        """Dataset to load samples from

        Args:
            sample_dir: Path to the directory of pre-saved samples.
        """
        self.sample_paths = glob(f"{sample_dir}/*")
        self.sample_class = sample_class

    def __len__(self):
        return len(self.sample_paths)

    def __getitem__(self, idx):
        sample = self.sample_class.load(self.sample_paths[idx])
        return sample.to_numpy()

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Dec 23, 2024

I guess the idea was to instead of having it in PVnet, have it in ocf-data-sampler. This keeps ml model code in PVnet, and samples stuff here. This means we can test it all here e.t.c

I agree #71 will be a nice tidy up for all of these

@dfulu
Copy link
Member

dfulu commented Dec 23, 2024

Yeh I see what you mean, but I'm not sure where the clean divide is.

In PVNet we have the save_samples.py script which defines the output directory structure for the pre-saved samples. We also have the datamodule which reads from that directory structure. So we could move this Dataset here, but without moving the others (which I don't think we should do) I'm not sure that it makes the divide much cleaner.

I have also been thinking about cross-validation and how we can split the samples after we've saved them. Currently in our backtest we have a model which was trained on 2019-2022 and validated on 2023. We make predictions on 2019-2023 with it for the backtest, so the backtest results are overfit. It we want to run cross validation for the backtest we would need to filter the pre-saved training samples by time. That means we'd need to string together save_samples.py script, the Datamodule and the Dataset in some way to do this. So I actually do think its less complex to have the Dataset in PVNet. Plus the ability to split data and do cross-validation is conceptually a ML code thing rather than a samples thing.

@peterdudfield
Copy link
Contributor Author

I agree something should be done with cross-validation.

I think we should move the Dataset for loading presamples either all here, or all in PVnet. I thinkt he current one for PVnet UK regional is in PVnet and this is much simplier as all it does is load the .pt file (I think).

If we decide to keep it in PVNet, then we should atleast put an example in here. Just to show how we can use it? Maybe something we can chat about in one of the first ML meetings. Good to think about positives and negatives e.t.c

@dfulu
Copy link
Member

dfulu commented Dec 23, 2024

I have a feeling that once we have created the sample class which has a .load() function then it'll be pretty trivial and won't need an example here. But yeh, let's talk about it in the new year

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants