Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PV Fleets QA pipeline examples #202

Merged
merged 38 commits into from
Feb 12, 2024

Conversation

kperrynrel
Copy link
Member

@kperrynrel kperrynrel commented Jan 11, 2024

  • Closes Power, irradiance, and temperature examples from PV Fleets pipeline? #201
  • Non-API functions clearly documented with docstrings or comments as necessary.
  • Adds description and name entries in the appropriate "what's new" file
    in docs/whatsnew
    for all changes. Includes link to the GitHub Issue with :issue:`num`
    or this Pull Request with :pull:`num`. Includes contributor name
    and/or GitHub username (link with :ghuser:`user`).
  • Pull request is nearly complete and ready for detailed review.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

This PR includes QA examples of the PV fleets pipeline, adapted from the following repo: https://github.com/kperrynrel/fleets_qa_examples

@cwhanse
Copy link
Member

cwhanse commented Jan 11, 2024

@kperrynrel I think you'll need a README.rst file in order for the section of the Examples to be created. I would call this group of examples PVFleets QA pipeline.

@kperrynrel kperrynrel self-assigned this Jan 11, 2024
@kperrynrel kperrynrel added the documentation Improvements or additions to documentation label Jan 11, 2024
@kperrynrel kperrynrel added this to the v0.2.0 milestone Jan 11, 2024
@kandersolar
Copy link
Member

The test failures can be ignored here; they'll be fixed in #203.

@kperrynrel
Copy link
Member Author

@kandersolar these are ready for review. Do we have a limit on example data size? The files are bigger for these examples and I could host them on the Duramat datahub if need be. Let me know!

Copy link
Member

@cwhanse cwhanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kperrynrel many of the style-related comments in the irradiance example also apply to the power example.

These are lengthy scripts, thanks for preparing them.

Are there steps that you think could be packaged into new pvanalytics functions? Something to consider for future work.

docs/examples/pvfleets-qa-pipeline/README.rst Outdated Show resolved Hide resolved
Comment on lines 130 to 136
# Filter the time series, taking out all of the issues
time_series = time_series[~stale_data_mask]
time_series = time_series[~negative_mask]
time_series = time_series[erroneous_mask]
time_series = time_series[~out_of_bounds_mask]
time_series = time_series[~zscore_outlier_mask]
time_series = time_series.asfreq(data_freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential filtering makes me uncomfortable. Although the plot shows that time_series maintains its original length, the statements make me think that points are being dropped. I may be reading these lines as arrays and these are OK because pandas handles the indexing.

Copy link
Member Author

@kperrynrel kperrynrel Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asfreq() call at the end of this returns the data to its original frequency, with NaN's for the data that's been filtered out. To make this less sequential, I have updated this to a single call:

# Filter the time series, taking out all of the issues
issue_mask = ((~stale_data_mask) & (~negative_mask) &
          (~erroneous_mask) & (~zscore_outlier_mask))
time_series = time_series[issue_mask]
time_series = time_series.asfreq(data_freq)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much easier to read, thanks

# post-filtering.

# Filter the time series, taking out all of the issues
time_series = time_series[~stale_data_mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the irradiance example: do these filters change the length of the Series?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the same here as above to make the mask a single call. asfreq() returns the time series to original data frequency with NaN's:

# Filter the time series, taking out all of the issues
issue_mask = ((~stale_data_mask) & (temperature_limit_mask) &
              (~zscore_outlier_mask))
time_series = time_series[issue_mask]
time_series = time_series.asfreq(data_freq)

docs/examples/pvfleets-qa-pipeline/pvfleets-power-qa.py Outdated Show resolved Hide resolved
@kandersolar
Copy link
Member

Do we have a limit on example data size? The files are bigger for these examples and I could host them on the Duramat datahub if need be.

Looks like adding these files will increase the size of the data directory from ~2 MB to ~80 MB. It would be nice to keep both the repository and the PyPI distribution files lightweight if we can, and that big of an increase for data files that many/most installations won't ever touch does seem a bit wasteful. Keeping the datasets on the Data Hub instead of this repository comes with the downside of having to make network requests during the docs build, which isn't desirable either.

I explored a few alternative formats for system_50_ac_power_2_full_DST.csv. Of course storing as a binary format like parquet helps, as does cutting off unnecessary digits of precision in the power data, and using an encoding for the timestamp index that exploits the consistent 1-minute interval. Results:

  • CSV (current form): 50.8 MB
  • Zipped CSV: 7.9 MB
  • parquet: 14.8 MB
  • parquet, float32: 14.2 MB
  • parquet, float32, delta-encoded index: 4.6 MB
  • parquet, float32, delta-encoded index, negative values clipped to zero: 3.2 MB

For a dataset with ~1.4 million values (plus as many timestamps), I doubt there's much more opportunity for easy reduction via encoding tweaks. For reference, here is the code to produce the 3.2 MB file:

df = pd.read_csv("system_50_ac_power_2_full_DST.csv", index_col=0, parse_dates=True)
df.astype(np.float32).clip(lower=0).reset_index().to_parquet("system_50_encoded.parquet",
                                                             use_dictionary=False,
                                                             column_encoding={"measured_on": "DELTA_BINARY_PACKED"})

The next question is: do we need the full data file? It's currently ~2.5 years. Cutting it down to one year gets the size down to 1.2 MB. 6 months is only 600 kB. Would shorter data periods still achieve the goals of these examples? Or resample the file to 5-minute averages?

@kperrynrel
Copy link
Member Author

@kandersolar I upsampled all the data so it's 15 minute averaged, and then converted to parquet format. We are now getting an error on parquet support:

  • Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
  • Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet

You ok with me adding either fastpaquet or pyarrow as a dependency? Also, do you have a preference on one?

@kandersolar
Copy link
Member

You ok with me adding either fastpaquet or pyarrow as a dependency? Also, do you have a preference on one?

Yep! I think we can get away with keeping it as a docs requirement rather than an installation requirement. I'd say mild preference for pyarrow for its maturity.

@kperrynrel
Copy link
Member Author

@kandersolar @cwhanse let me know if there's anything else I can update here before the 2.0 release! I updated some of the matplotlib rendering yesterday and am feeling ok with how it is looking. Thanks!



# %%
# Display the final irradiance time series, post-QA filtering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plot isn't rendering

Copy link
Member

@cwhanse cwhanse Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That plot still isn't rendering, idk why not. @kperrynrel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's rendering now, not sure what was going on with it:
image

Copy link
Member

@cwhanse cwhanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kperrynrel

Copy link
Member

@kandersolar kandersolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kperrynrel! One minor comment, otherwise good to merge

Comment on lines 61 to 62
# 3) "Abnormal" data periods, which are defined as less than 10% of the
# daily time series mean OR greater than 1300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "less than 10% of daily mean" rule is used in the power analysis code, but not here. Should this text say something about the daily minimum being above 50 W/m2?

Copy link
Member Author

@kperrynrel kperrynrel Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch at @kandersolar, updated docstring to the following in commit 0cadb6e:

  1. "Abnormal" data periods, which are defined as less than the daily minimum of 50 OR any data greater than 1300

Looks like the docs rendering in the checks for some reason isn't updating, even though the underlying code has been updated. Not sure what's going on with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "greater than" instead of "less than"?

I often have to manually refresh the RTD page to see updates after a rebuild. Something to do with browser caching. I'm guessing that's what you're experiencing too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, updated in 5a01c3a

@kandersolar kandersolar merged commit 5e05983 into pvlib:main Feb 12, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Power, irradiance, and temperature examples from PV Fleets pipeline?
3 participants