Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Which Capacity should we use? #692

Open
peterdudfield opened this issue Jun 28, 2022 · 8 comments
Open

Which Capacity should we use? #692

peterdudfield opened this issue Jun 28, 2022 · 8 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Jun 28, 2022

Which capacity of the PV system should we use?
It is useful to normalise the PV data ready for ML and using the capacity makes sense.
Importantly the training and prediction must use the same. Hence #691

The metadata for pvoutput,org and Passiv provide capacity values.
Note that The pv capacity can degrade over time, so a static one might not be so good

1. Use the maximum values of the training set

Pros:

  • the system will be between 0 and 1 inclusive
  • ML model can be PV system agnostic, as data will be between 0 and 1.

Cons:

  • the predictions data might not be between 0 and 1 if not data in training (unlikely)
  • I did this with the first CNN model and had to increase lots of capacities of ~10 systems, and reduce lots of capacities of ~10 systems

2. Use metadata

Pros:

  • This number is constant, its given to us

Cons:

  • It could be way off the actual power produced.

3. Hybrid 1

  • use 1
  • adjust capacity if prediction data > 100%
  • adjust capacity if prediction data < 50% (over a hsitory of collecting it live)

4. Hybrid 2

  • use 2
  • If training data < 50%, then use 1
  • If training data > 100% then use 1
  • If prediction data > 100% then use 1
  • If over a week of data, (which includes good sunny days) data < 50%, then use 1
    (These lower and upper bounds could change)
@peterdudfield
Copy link
Contributor Author

@JackKelly and @jacobbieker would be interested to hear your thoughts

@jacobbieker
Copy link
Member

I kind of like Hypbrid 2 as the option, so that we try to trust the labelled values, but if the data is wrong, we are correcting for it.

@JackKelly
Copy link
Member

If I remember correctly, @dantravers has thought about this issue! I'd be keen to hear his thoughts!

I don't trust the metadata very much 🙂. Not least because we don't know if the metadata "PV capacity" is the DC capacity; or the AC capacity; or the planned capacity; or the capacity of the grid connection 🙂.

So I'd lean towards using a "robust" statistical method for inferring the max from the entire timeseries (not just the training set. But over the entire dataset.). e.g. take the 99th percentile (i.e. ignore outliers) as the "max", and then clip the power data at that "max" (to guarantee that the normalised value never exceeds 1).

On the topic of degredation over time: One way to address this is to give the ML model the age of the PV system (if we have it), rather than try to manually correct for the degredation by re-computing the PV capacity every, say, year. (One edge-case where we might actually want to re-compute the "max" is large PV farms where they might install, say, 50% of the farm one month, and the other 50% another month.)

As a separate issue, I think we also need to come up with a set of algorithms for identifying "dud" PV data (e.g. generating at night. Or generating very little on a sunny day). I'm pretty sure that "bad" PV data is hurting us in multiple ways!

@peterdudfield
Copy link
Contributor Author

Happy to come up with a statistical method. Just need to make sure we are making it robust for when we are collecting more data for prediction. We don't want one collected value to make us retrain the entire model. But perhaps we can clean up the re-training, so its more like an iterative process, so maybe we will be ok.

I like the 99th precentile, that gets ride of any blips in the data.

@dantravers
Copy link

dantravers commented Jun 29, 2022

You're right, @JackKelly - I've thought about this a decent amount!
I was using capacity for slightly different purposes - more on the post-analysis of results, so I was more interested in preserving the actual installed capacity (to get true yield values). In the ML forecast model, if you don't mind what the normalisation is, then the true capacity matters less (more than you range [0, 1])

But roughly:
I found there were a number of systems which had crazy high individual half hours with high values way above what was possible. To remediate this I looked at the right side tail for outliers by taking a high percentile (Y% - I used about 99.9%) and then looking at if the max outturn value was more than x% (I used 20%, but up for discussion) higher than that value.
Case 1: If this wasn't the case, then I was happy to believe the right-side tail was valid. In this case we could normalize by the max outturn.
Case 2: If this was the case, then I threw away the erroneous values. In our situation, we could cap at the Y%.

One option would be to do Case 2 all the time, and cap at Y%, whcih I think is OK, but if you look at the distribution of outturn values, and very often there is interesting stuff happening in the top of the range, which I would be loath to lose unless we had to.

Note: I'm not sure if this will have any bearing on the ML, but the actual installed capacity values will be quite different from what we are calcualting here - and it will depend on the orientation of the systems. I.e. north facing will have higher installed capacity than south facing for the same observed generation figures.

@peterdudfield
Copy link
Contributor Author

Perhaps a good way to do this would be to have

  • installed_capacity (from provider)
  • installed_capacity_calculated,

Then we store both and use the 'calculated_installed_capacity' for ML tasks.

@dantravers
Copy link

dantravers commented Jun 29, 2022 via email

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Jun 29, 2022

I would be tempted to try to keep the database it clean so we store nameplate capacity and our ML capacity. The other things we can get using analysis of the data.

I realise it was not obvious from above I was talking about the database - sorry about that

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants