-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added PV Fleets QA pipeline examples #202
Conversation
@kperrynrel I think you'll need a README.rst file in order for the section of the Examples to be created. I would call this group of examples PVFleets QA pipeline. |
The test failures can be ignored here; they'll be fixed in #203. |
@kandersolar these are ready for review. Do we have a limit on example data size? The files are bigger for these examples and I could host them on the Duramat datahub if need be. Let me know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kperrynrel many of the style-related comments in the irradiance example also apply to the power example.
These are lengthy scripts, thanks for preparing them.
Are there steps that you think could be packaged into new pvanalytics functions? Something to consider for future work.
# Filter the time series, taking out all of the issues | ||
time_series = time_series[~stale_data_mask] | ||
time_series = time_series[~negative_mask] | ||
time_series = time_series[erroneous_mask] | ||
time_series = time_series[~out_of_bounds_mask] | ||
time_series = time_series[~zscore_outlier_mask] | ||
time_series = time_series.asfreq(data_freq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sequential filtering makes me uncomfortable. Although the plot shows that time_series
maintains its original length, the statements make me think that points are being dropped. I may be reading these lines as arrays
and these are OK because pandas
handles the indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The asfreq() call at the end of this returns the data to its original frequency, with NaN's for the data that's been filtered out. To make this less sequential, I have updated this to a single call:
# Filter the time series, taking out all of the issues
issue_mask = ((~stale_data_mask) & (~negative_mask) &
(~erroneous_mask) & (~zscore_outlier_mask))
time_series = time_series[issue_mask]
time_series = time_series.asfreq(data_freq)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much easier to read, thanks
# post-filtering. | ||
|
||
# Filter the time series, taking out all of the issues | ||
time_series = time_series[~stale_data_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in the irradiance example: do these filters change the length of the Series
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the same here as above to make the mask a single call. asfreq() returns the time series to original data frequency with NaN's:
# Filter the time series, taking out all of the issues
issue_mask = ((~stale_data_mask) & (temperature_limit_mask) &
(~zscore_outlier_mask))
time_series = time_series[issue_mask]
time_series = time_series.asfreq(data_freq)
Co-authored-by: Cliff Hansen <[email protected]>
Co-authored-by: Cliff Hansen <[email protected]>
Co-authored-by: Cliff Hansen <[email protected]>
Co-authored-by: Cliff Hansen <[email protected]>
Looks like adding these files will increase the size of the I explored a few alternative formats for
For a dataset with ~1.4 million values (plus as many timestamps), I doubt there's much more opportunity for easy reduction via encoding tweaks. For reference, here is the code to produce the 3.2 MB file: df = pd.read_csv("system_50_ac_power_2_full_DST.csv", index_col=0, parse_dates=True)
df.astype(np.float32).clip(lower=0).reset_index().to_parquet("system_50_encoded.parquet",
use_dictionary=False,
column_encoding={"measured_on": "DELTA_BINARY_PACKED"}) The next question is: do we need the full data file? It's currently ~2.5 years. Cutting it down to one year gets the size down to 1.2 MB. 6 months is only 600 kB. Would shorter data periods still achieve the goals of these examples? Or resample the file to 5-minute averages? |
@kandersolar I upsampled all the data so it's 15 minute averaged, and then converted to parquet format. We are now getting an error on parquet support:
You ok with me adding either fastpaquet or pyarrow as a dependency? Also, do you have a preference on one? |
Yep! I think we can get away with keeping it as a docs requirement rather than an installation requirement. I'd say mild preference for pyarrow for its maturity. |
@kandersolar @cwhanse let me know if there's anything else I can update here before the 2.0 release! I updated some of the matplotlib rendering yesterday and am feeling ok with how it is looking. Thanks! |
|
||
|
||
# %% | ||
# Display the final irradiance time series, post-QA filtering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This plot isn't rendering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That plot still isn't rendering, idk why not. @kperrynrel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kperrynrel
Co-authored-by: Cliff Hansen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kperrynrel! One minor comment, otherwise good to merge
# 3) "Abnormal" data periods, which are defined as less than 10% of the | ||
# daily time series mean OR greater than 1300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "less than 10% of daily mean" rule is used in the power analysis code, but not here. Should this text say something about the daily minimum being above 50 W/m2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch at @kandersolar, updated docstring to the following in commit 0cadb6e:
- "Abnormal" data periods, which are defined as less than the daily minimum of 50 OR any data greater than 1300
Looks like the docs rendering in the checks for some reason isn't updating, even though the underlying code has been updated. Not sure what's going on with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be "greater than" instead of "less than"?
I often have to manually refresh the RTD page to see updates after a rebuild. Something to do with browser caching. I'm guessing that's what you're experiencing too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, updated in 5a01c3a
in
docs/whatsnew
for all changes. Includes link to the GitHub Issue with
:issue:`num`
or this Pull Request with
:pull:`num`
. Includes contributor nameand/or GitHub username (link with
:ghuser:`user`
).This PR includes QA examples of the PV fleets pipeline, adapted from the following repo: https://github.com/kperrynrel/fleets_qa_examples