Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema for Generation data #46

Open
peterdudfield opened this issue Sep 6, 2024 · 11 comments
Open

Add schema for Generation data #46

peterdudfield opened this issue Sep 6, 2024 · 11 comments
Assignees
Labels
discussion enhancement New feature or request

Comments

@peterdudfield
Copy link
Contributor

Detailed Description

It would be great to add a schema for the generation data

Context

  • good to know what the data should look like

Possible Implementation

  • descriptions
  • screen shot of xarray dataset
  • pydantic model? for metadata?
@Sukh-P Sukh-P changed the title Add scheme for Generation data Add schema for Generation data Sep 6, 2024
@Sukh-P
Copy link
Member

Sukh-P commented Sep 30, 2024

Going to put some thoughts here on this, first of that the bits for using generation data (pv or wind) are not in this repo yet, there is GSP data loading but I think that is different enough that it can be outside the scope of this (can always revisit that), so this schema can start to be used when we add the PV Site Dataset in, and the info for it can be added in then as well:

Reason for having site/generation data schemas:

  • Easier and clearer for someone to understand what format the data needs to be in for ocf-data-sampler to “just work” with it
  • Less custom code needed in ocf-data-sampler to support slightly different formats (e.g different power units) (instead input files are put into the schema format before being used with these libraries)
  • With it it would be easier to have one load function for either pv or wind (feels like a lot of duplicated code between the two currently)

Proposed Schemas:

  • Metadata (currently csv format) columns, each row corresponds to a single site:

    • client_site_id: str - name of site in original site data shared
    • site_id: int (zero indexed values)
    • capacity_kw: float
    • latitude : float
    • longitude : float
    • asset_type: e.g solar or wind
      - ml_id: optional int. This is currently added in if it doesn’t exist and set to -1.
  • Generation power data (currently netcdf file):

    • client_site_id: str - name of site in original site data shared
    • site_id: int, should correspond to the id in metadata
    • timestamp (utc) datetime64[ns] format, should having granularity which is divisible by 5 mins
    • power_kw :float

Assumed MW would be the best power unit for this since that is usually the size of the sites we deal with, also removed tilt and orientation for now because I don't think these are actually used anywhere currently.

In terms of enforcing this schema, I think it could be done with the lighter touch side of documentation in a README and comments pointing to that in config rather than automated schema validation (which could be achieved if we used something like this, but I think a lighter touch approach would be adequate for now and make it clearer what is supported and lead to less ad hoc code. Keen to hear people's thoughts on this though. @peterdudfield @dfulu @AUdaltsova, thanks

@AUdaltsova
Copy link
Contributor

re: ml_id, I am still a bit hazy on how this is different from site id (when we're talking pvnet-site). and I think all the data i used recently didn't have it. What am I missing?

@AUdaltsova
Copy link
Contributor

AUdaltsova commented Sep 30, 2024

timestamp (utc) datetime64[ns] format, should having granularity which is divisible by 5 mins

I think we don't require this anymore, but could be wrong

I really like the netcdf organisation! And thanks a lot for doing this :)

@AUdaltsova
Copy link
Contributor

I also agree that being super-strict on this is not good right now, we will probably want stuff like tilt and orientation stored in the same files anyway and there might be unforeseen differences we cant reconcile depending on whose data we're using

@dfulu
Copy link
Member

dfulu commented Sep 30, 2024

re: ml_id, I am still a bit hazy on how this is different from site id (when we're talking pvnet-site). and I think all the data i used recently didn't have it. What am I missing?

ml_id is not in general related to site-id at all. Previously we wanted to switch to using ml_id because the IDs had become a real mess in ocf_datapipes. Back then we wanted to use site-level PV history from a few hundred sites as an input to PVNet UK. We had a whole lot of IDs floating round at the time and couldn't match between a given PV site in the production database and in the training data - i.e. we couldn't tell which PV site was which. ml_id was used to allow us to do that, but it is a misnomer as it has nothing to do with ML really.

I think that currently ml_id would still be needed to map between the training and production PV sites in the UK but I haven't looked at this in a year. We also aren't using PV site data as an input to PVNet UK. So we may want to revisit whether we need ml_id or if we can use a single id for systems. I think having a schema could help with the transition to a single ID

@peterdudfield
Copy link
Contributor Author

So i think its good for ml_id to start very small, so our embedding dim can be small.

Sometimes its useful to have a site id which maps to the system_id that the client has given us, so we know which site is which.

I think that was the motiviation

I'd be tempted to

  • get ride of ml_id
  • use system id for what we embed on
  • add client_site_name:str which we can use as a mapping from client name/id to ml_id

@Sukh-P
Copy link
Member

Sukh-P commented Oct 1, 2024

Thanks for the comments all, that is really helpful!

I agree with those points and suggestions and have updated the proposed schemas above, the next step will be to add this info in/use these schemas for renewable generation data files when we add in those datasets for PV and Wind here, hopefully it will be useful in the long term!

@peterdudfield peterdudfield mentioned this issue Oct 31, 2024
10 tasks
@AUdaltsova
Copy link
Contributor

By the way, do we store individual sites in separate netcdf files or all sites from say one project are in one netcdf file?

@Sukh-P
Copy link
Member

Sukh-P commented Nov 4, 2024

I would lean towards one netcdf for all sites, I think it's easiest/get the input datasets together that way

Also on the units perhaps MW should be kW as I think this is the most used power units across our code base. Finally, I think we should stick to "site" or "system", I have gone with "site" since I think that is what we use most now, updated the schema outline above to reflect this

@AUdaltsova
Copy link
Contributor

Ok thanks for clarifying!

@peterdudfield
Copy link
Contributor Author

yea i like

  • 1 nc file
  • site
  • kW probably is easier at the moment, and the units should be in the variable name!

@Sukh-P Sukh-P added the enhancement New feature or request label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants