Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PV system IDs #220

Closed
dfulu opened this issue Aug 9, 2023 · 5 comments
Closed

Fix PV system IDs #220

dfulu opened this issue Aug 9, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@dfulu
Copy link
Member

dfulu commented Aug 9, 2023

Describe the bug

When using the Passiv PV for training and loading it usingocf_datapipes.load.pv.pv.load_everything_into_ram(), the function takes the ss_id ID and renames it as pv_system_id in the xarray.Dataset returned. In the input metadata, there is also a column named system_id, which is not carried through to the xarray.Dataset. Both of the ss_id and system_id initially come from the sheffield solar API and they are different.

Currently, only the system_id ID is available in our database, although it is under the pv_system_id column there. Also, when we load from the database we encode this ID like n -> int(f"{n}1").

All this makes it quite difficult to match between the same systems in training and production. Being able to match will be important for adding Passiv inputs to PVNet as per openclimatefix/PVNet#68

Some potential fixes

  1. Stick to the ss_id
    • This would mean making ss_id available in the database
    • This could avoid us needing to encode the ID when loading from the database
    • I believe the ss_id is unique for each system across Enphase, Passiv, and PVOutput. This means that we would need to change the library if we wanted to use PV systems outside of these sources.
    • However, we already have the ss_id (sheffield solar ID?) already hard coded into load_everything_into_ram(), so we would likely need to change that anyway.
  2. Stick to system_id.
    • The database wouldn't need to be changed, but I think some work would still need to be done to get_pv_power_from_database() and maybe think about the encoding.
    • We would also need to pull system_id through load_everything_into_ram(), and that may involve some renaming of the IDs variables.
  3. Create our own ID strategy, perhaps like the encode_label() and apply this everywhere consistently.
  4. Commit to using multiple IDs.
    • Add system_id to the dataset created in load_everything_into_ram() and select systems based on this.
    • Don't change get_pv_power_from_database() but just select the IDs like system_id -> int(f"{system_id}1")
    • This could be the easiest option to implement, but feels like it adds technical debt
@dfulu dfulu added the bug Something isn't working label Aug 9, 2023
@dfulu
Copy link
Member Author

dfulu commented Aug 9, 2023

@jacobbieker and @peterdudfield, thoughts?

@jacobbieker
Copy link
Member

I would be inclined to our own encode_label() option, as we use more PV systems, it could keep the code a bit easier than multiple ID systems. Keeping them consistent might be a bit of work, but probably good long term too

@peterdudfield
Copy link
Contributor

peterdudfield commented Aug 10, 2023

Thanks for putting this all together, its really nice having it all laid out. Ive put some thought into here - openclimatefix/nowcasting_dataset#691, but never finished it.

Few comments

  • something seems wrong that load_everything_into_ram renames something into what it isnt. That probably made some of the confusion.
  • At some point there was a pv_system_row_number which I guess is like our encoding. If we want to encode the PV system into the NN model, we probably should use as low numbers as possible. If so, then a CSV with the columns, id, system_id, client_id, provider would probably be needed. You could just use the id's in the database already
  1. The problem with 1, is we will soon have a few systems that are not coming through Sheffield Solar, so that might muck this up.

  2. changing the database is annoying.

  3. is what I've linked too, and perhaps that is the good way to do it. Enocoding like this would work, i'd been tempted to do n -> int(f"{n}{xx}") where xx i the number of provider. We might quickly go over 10 so 2 digits gives us a bit more room

  4. if you have time not add tech debt, then we should

@dfulu
Copy link
Member Author

dfulu commented Aug 10, 2023

I agree that load_everything_into_ram() should be modified.

pv_system_row_number is computed based on the input dataset and I don't know if it would be a reliable ID to use to identify the system between datasets. Like between training and production. When loading training data, that number comes from the order of the PV systems in the training dataset, like the PV/Passive/ocf_formatted/v0/passive.netcdf file on GCP rather than being from the metadata csv. The order of the the PV systems in the passive.netcdf (and therefore the pv_system_row_number) seems to be quite arbitrary.

Also, unless I'm mistaken, for loading the PV data from the database we set the pv_system_row_number to 1 for all systems.

Something like 3 might be the best bet

@dfulu
Copy link
Member Author

dfulu commented Oct 27, 2023

I'm going to close this now. This was fixed by #225 and #226 and by adding ml_id to the training data in order to match the database data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants