Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PV Fleets QA pipeline examples #202

Merged
merged 38 commits into from
Feb 12, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
dd40be1
Added the temperature QA example
kperrynrel Jan 11, 2024
ac03286
added temp QA dictionary
kperrynrel Jan 11, 2024
63ea150
fixing pep8 errors for temp example
kperrynrel Jan 11, 2024
2c94bc2
added readme.rst at @cwhanse's guidance
kperrynrel Jan 11, 2024
f38f39a
updated the irradiance routine, currently working on power
kperrynrel Jan 11, 2024
c65689e
pep8 compliant formatting
kperrynrel Jan 11, 2024
b6de6ab
refactored the power example so it runs end-to-end
kperrynrel Jan 11, 2024
7f000b7
added psm3 data
kperrynrel Jan 17, 2024
6ccc21c
debug psm3/df NaNs
kperrynrel Jan 20, 2024
ed39d0f
update documentation error with graphic
kperrynrel Jan 22, 2024
ead2719
more cleanup of the examples
kperrynrel Jan 24, 2024
cf3cb79
added cutoff line for daily completeness score
kperrynrel Jan 24, 2024
ff8e263
updated the whatsnew file
kperrynrel Jan 24, 2024
3eb3e19
fixing example errors
kperrynrel Jan 24, 2024
4083d7b
fix completeness score graphics
kperrynrel Jan 24, 2024
6578747
Update docs/examples/pvfleets-qa-pipeline/pvfleets-power-qa.py
kperrynrel Jan 24, 2024
793956f
Update docs/examples/pvfleets-qa-pipeline/README.rst
kperrynrel Jan 24, 2024
f16a078
Update docs/examples/pvfleets-qa-pipeline/pvfleets-irradiance-qa.py
kperrynrel Jan 24, 2024
3dbc89b
Update docs/examples/pvfleets-qa-pipeline/pvfleets-irradiance-qa.py
kperrynrel Jan 24, 2024
cea4fbb
fixed redundant data shift call in scripts
kperrynrel Jan 25, 2024
d3c2d56
more cleanup of text
kperrynrel Jan 25, 2024
54d7768
fixed erroneous data issues
kperrynrel Jan 29, 2024
81cf2a2
cleaned up the mask sequencing
kperrynrel Jan 29, 2024
f93f6c5
updated the mask issues, making data issue mask a single mask
kperrynrel Jan 29, 2024
e209d0e
significantly reduced the file sizes by taking smaller snapshots + re…
kperrynrel Jan 30, 2024
ccf7e13
more updates to file sizes
kperrynrel Jan 30, 2024
23c810d
updated the irradiance stream size
kperrynrel Jan 30, 2024
8f66349
added parquet files for gallery testing
kperrynrel Feb 2, 2024
850b650
added pyarrow as doc requirement
kperrynrel Feb 8, 2024
eb8ac00
update the plots so the axes aren't cutoff--fix other issues with T->min
kperrynrel Feb 8, 2024
93a59d5
Merge branch 'main' into pvfleets-examples
kperrynrel Feb 8, 2024
bb800d9
fixed pep8 issues
kperrynrel Feb 8, 2024
66d0d21
updated the routine to get rid of power error
kperrynrel Feb 8, 2024
d12a58f
more standardizing of the output graphs
kperrynrel Feb 9, 2024
70fdaa3
Update docs/examples/pvfleets-qa-pipeline/pvfleets-irradiance-qa.py
kperrynrel Feb 9, 2024
99e8461
addressing doctstring comments that @kandersolar pointed out
kperrynrel Feb 12, 2024
0cadb6e
merging in pushes
kperrynrel Feb 12, 2024
5a01c3a
updates to docstring
kperrynrel Feb 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/examples/pvfleets-qa-pipeline/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
PVFleets QA Examples
--------------------

This includes examples highlighting the QA process for temperature, power and irradiance data streams that is used in the NREL
PV Fleet Performance Data Initiative (https://www.nrel.gov/pv/fleet-performance-data-initiative.html).
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved
375 changes: 375 additions & 0 deletions docs/examples/pvfleets-qa-pipeline/pvfleets-irradiance-qa.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
"""
PV Fleets QA Process: Irradiance
================================

PV Fleets Irradiance QA Pipeline
"""

# %%
# The NREL PV Fleets Data Initiative uses PVAnalytics routines to assess the
# quality of systems' PV data. In this example, the PV Fleets process for
# assessing the data quality of an irradiance data stream is shown. This
# example pipeline illustrates how several PVAnalytics functions can be used
# in sequence to assess the quality of an irradiance data stream.

import pandas as pd
import pathlib
from matplotlib import pyplot as plt
import pvanalytics
import pvlib
from pvanalytics.quality import data_shifts as ds
from pvanalytics.quality import gaps
from pvanalytics.quality.outliers import zscore
from pvanalytics.features.daytime import power_or_irradiance
from pvanalytics.quality.time import shifts_ruptures
from pvanalytics.features import daytime

# %%
# First, we import a POA irradiance data stream from a PV installation
# at NREL. This data set is publicly available via the PVDAQ database in the
# DOE Open Energy Data Initiative (OEDI)
# (https://data.openei.org/submissions/4568), under system ID 15.
# This data is timezone-localized.

pvanalytics_dir = pathlib.Path(pvanalytics.__file__).parent
file = pvanalytics_dir / 'data' / 'system_15_poa_irradiance.csv'
time_series = pd.read_csv(file, index_col=0, parse_dates=True).squeeze()
latitude = 39.7406
longitude = -105.1775
data_freq = '15T'
time_series = time_series.asfreq(data_freq)

# %%
# First, let's visualize the original time series as reference.

time_series.plot(title="Original Time Series")
plt.show()

# %%
# Now, let's run basic data checks to identify stale and abnormal/outlier
# data in the time series. Basic data checks include the following steps:
#
# 1) Flatlined/stale data periods (:py:func:`pvanalytics.quality.gaps.stale_values_round`)
# 2) Negative irradiance data
# 3) "Abnormal" data periods, which are defined as less than 10% of the
# daily time series mean OR greater than 1300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "less than 10% of daily mean" rule is used in the power analysis code, but not here. Should this text say something about the daily minimum being above 50 W/m2?

Copy link
Member Author

@kperrynrel kperrynrel Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch at @kandersolar, updated docstring to the following in commit 0cadb6e:

  1. "Abnormal" data periods, which are defined as less than the daily minimum of 50 OR any data greater than 1300

Looks like the docs rendering in the checks for some reason isn't updating, even though the underlying code has been updated. Not sure what's going on with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "greater than" instead of "less than"?

I often have to manually refresh the RTD page to see updates after a rebuild. Something to do with browser caching. I'm guessing that's what you're experiencing too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, updated in 5a01c3a

# 4) Outliers, which are defined as more than one 4 standard deviations
# away from the mean (:py:func:`pvanalytics.quality.outliers.zscore`)

# 1) REMOVE STALE DATA (that isn't during nighttime periods)
# Day/night mask
daytime_mask = power_or_irradiance(time_series)
# Stale data mask
stale_data_mask = gaps.stale_values_round(time_series,
window=3,
decimals=2)
stale_data_mask.loc[(stale_data_mask) &
(~daytime_mask)] = False
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved

# 2) REMOVE NEGATIVE DATA
negative_mask = (time_series < 0)

# FIND ABNORMAL PERIODS
daily_min = time_series.resample('D').min()
erroneous_mask = (daily_min < 50)
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved
erroneous_mask = erroneous_mask.reindex(index=time_series.index,
method='ffill',
fill_value=False)

# Remove values greater than or equal to 1300
out_of_bounds_mask = (time_series >= 1300)

# FIND OUTLIERS (Z-SCORE FILTER)
zscore_outlier_mask = zscore(time_series,
zmax=4,
nan_policy='omit')

# Get the percentage of data flagged for each issue, so it can later be logged
pct_stale = round((len(time_series[
stale_data_mask].dropna())/len(time_series.dropna())*100), 1)
pct_negative = round((len(time_series[
negative_mask].dropna())/len(time_series.dropna())*100), 1)
pct_erroneous = round((len(time_series[
~erroneous_mask].dropna())/len(time_series.dropna())*100), 1)
pct_outlier = round((len(time_series[
zscore_outlier_mask].dropna())/len(time_series.dropna())*100), 1)

# Visualize all of the time series issues (stale, abnormal, outlier, etc)
time_series.plot()
labels = ["Irradiance"]
if any(stale_data_mask):
time_series.loc[stale_data_mask].plot(ls='', marker='o', color="green")
labels.append("Stale")
if any(negative_mask):
time_series.loc[negative_mask].plot(ls='', marker='o', color="orange")
labels.append("Negative")
if any(~erroneous_mask):
time_series.loc[~erroneous_mask].plot(ls='', marker='o', color="yellow")
labels.append("Abnormal")
if any(out_of_bounds_mask):
time_series.loc[out_of_bounds_mask].plot(ls='', marker='o', color="yellow")
labels.append("Too High")
if any(zscore_outlier_mask):
time_series.loc[zscore_outlier_mask].plot(
ls='', marker='o', color="purple")
labels.append("Outlier")
plt.legend(labels=labels)
plt.title("Time Series Labeled for Basic Issues")
plt.xticks(rotation=20)
plt.xlabel("Date")
plt.ylabel("Irradiance")
plt.tight_layout()
plt.show()


# %%
# Now, let's filter out any of the flagged data from the basic irradiance
# checks (stale or abnormal data). Then we can re-visualize the data
# post-filtering.

# Filter the time series, taking out all of the issues
time_series = time_series[~stale_data_mask]
time_series = time_series[~negative_mask]
time_series = time_series[erroneous_mask]
time_series = time_series[~out_of_bounds_mask]
time_series = time_series[~zscore_outlier_mask]
time_series = time_series.asfreq(data_freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential filtering makes me uncomfortable. Although the plot shows that time_series maintains its original length, the statements make me think that points are being dropped. I may be reading these lines as arrays and these are OK because pandas handles the indexing.

Copy link
Member Author

@kperrynrel kperrynrel Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asfreq() call at the end of this returns the data to its original frequency, with NaN's for the data that's been filtered out. To make this less sequential, I have updated this to a single call:

# Filter the time series, taking out all of the issues
issue_mask = ((~stale_data_mask) & (~negative_mask) &
          (~erroneous_mask) & (~zscore_outlier_mask))
time_series = time_series[issue_mask]
time_series = time_series.asfreq(data_freq)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much easier to read, thanks


# Visualize the time series post-filtering
time_series.plot(title="Time Series Post-Basic Data Filtering")
plt.show()

# %%
# We filter the time series based on its daily completeness score. This
# filtering scheme requires at least 25% of data to be present for each day to
# be included. We further require at least 10 consecutive days meeting this
# 25% threshold to be included.

# Visualize daily data completeness
data_completeness_score = gaps.completeness_score(time_series)

# Visualize data completeness score as a time series.
data_completeness_score.plot()
plt.xlabel("Date")
plt.ylabel("Daily Completeness Score (Fractional)")
plt.axhline(y=0.25, color='r', linestyle='-',
label='Daily Completeness Cutoff')
plt.legend()
plt.tight_layout()
plt.show()

# Trim the series based on daily completeness score
trim_series = pvanalytics.quality.gaps.trim_incomplete(
time_series,
minimum_completeness=.25,
freq=data_freq)
first_valid_date, last_valid_date = \
pvanalytics.quality.gaps.start_stop_dates(trim_series)
time_series = time_series[first_valid_date.tz_convert(time_series.index.tz):
last_valid_date.tz_convert(time_series.index.tz)]
time_series = time_series.asfreq(data_freq)

# %%
# Next, we check the time series for any time shifts, which may be caused by
# time drift or by incorrect time zone assignment. To do this, we compared
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved
# the modelled midday time for the particular system location to its
# measured midday time. We use
# :py:func:`pvanalytics.quality.gaps.stale_values_round`) to determine the
# presence of time shifts in the series.

# Get the modeled sunrise and sunset time series based on the system's
# latitude-longitude coordinates
modeled_sunrise_sunset_df = pvlib.solarposition.sun_rise_set_transit_spa(
time_series.index, latitude, longitude)

# Calculate the midday point between sunrise and sunset for each day
# in the modeled irradiance series
modeled_midday_series = modeled_sunrise_sunset_df['sunrise'] + \
(modeled_sunrise_sunset_df['sunset'] -
modeled_sunrise_sunset_df['sunrise']) / 2

# Run day-night mask on the irradiance time series
daytime_mask = power_or_irradiance(time_series,
freq=data_freq,
low_value_threshold=.005)

# Generate the sunrise, sunset, and halfway points for the data stream
sunrise_series = daytime.get_sunrise(daytime_mask)
sunset_series = daytime.get_sunset(daytime_mask)
midday_series = sunrise_series + ((sunset_series - sunrise_series)/2)

# Convert the midday and modeled midday series to daily values
midday_series_daily, modeled_midday_series_daily = (
midday_series.resample('D').mean(),
modeled_midday_series.resample('D').mean())

# Set midday value series as minutes since midnight, from midday datetime
# values
midday_series_daily = (midday_series_daily.dt.hour * 60 +
midday_series_daily.dt.minute +
midday_series_daily.dt.second / 60)
modeled_midday_series_daily = \
(modeled_midday_series_daily.dt.hour * 60 +
modeled_midday_series_daily.dt.minute +
modeled_midday_series_daily.dt.second / 60)

# Estimate the time shifts by comparing the modelled midday point to the
# measured midday point.
is_shifted, time_shift_series = shifts_ruptures(midday_series_daily,
modeled_midday_series_daily,
period_min=15,
shift_min=15,
zscore_cutoff=1.5)

# Create a midday difference series between modeled and measured midday, to
# visualize time shifts. First, resample each time series to daily frequency,
# and compare the data stream's daily halfway point to the modeled halfway
# point
midday_diff_series = (midday_series.resample('D').mean() -
modeled_midday_series.resample('D').mean()
).dt.total_seconds() / 60

# Generate boolean for detected time shifts
if any(time_shift_series != 0):
time_shifts_detected = True
else:
time_shifts_detected = False

# Build a list of dictionaries for time shifts
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved
time_shift_series.index = pd.to_datetime(
time_shift_series.index)
changepoints = (time_shift_series != time_shift_series.shift(1))
changepoints = changepoints[changepoints].index
changepoint_amts = pd.Series(time_shift_series.loc[changepoints])
time_shift_list = list()
for idx in range(len(changepoint_amts)):
if changepoint_amts[idx] == 0:
change_amt = 0
else:
change_amt = -1 * changepoint_amts[idx]
if idx < (len(changepoint_amts) - 1):
time_shift_list.append({"datetime_start":
str(changepoint_amts.index[idx]),
"datetime_end":
str(changepoint_amts.index[idx + 1]),
"time_shift": change_amt})
else:
time_shift_list.append({"datetime_start":
str(changepoint_amts.index[idx]),
"datetime_end":
str(time_shift_series.index.max()),
"time_shift": change_amt})

# Correct any time shifts in the time series
new_index = pd.Series(time_series.index, index=time_series.index)
for i in time_shift_list:
new_index[(time_series.index >= pd.to_datetime(i['datetime_start'])) &
(time_series.index < pd.to_datetime(i['datetime_end']))] = \
time_series.index + pd.Timedelta(minutes=i['time_shift'])
time_series.index = new_index

# Remove duplicated indices and sort the time series (just in case)
time_series = time_series[~time_series.index.duplicated(
keep='first')].sort_index()

# Plot the difference between measured and modeled midday, as well as the
# CPD-estimated time shift series.
midday_diff_series.plot()
time_shift_series.plot()
plt.title("Midday Difference Time Shift Series")
plt.show()

# Plot the heatmap of the irradiance time series
plt.figure()
# Get time of day from the associated datetime column
time_of_day = pd.Series(time_series.index.hour +
time_series.index.minute/60,
index=time_series.index)
# Pivot the dataframe
dataframe = pd.DataFrame(pd.concat([time_series, time_of_day], axis=1))
dataframe.columns = ["values", 'time_of_day']
dataframe = dataframe.dropna()
dataframe_pivoted = dataframe.pivot_table(index='time_of_day',
columns=dataframe.index.date,
values="values")
plt.pcolormesh(dataframe_pivoted.columns,
dataframe_pivoted.index,
dataframe_pivoted,
shading='auto')
plt.ylabel('Time of day [0-24]')
plt.xlabel('Date')
plt.xticks(rotation=60)
plt.title('Post-Correction Heatmap, Time of Day')
plt.colorbar()
plt.tight_layout()
plt.show()

# %%
# Next, we check the time series for any abrupt data shifts. We take the
# longest continuous part of the time series that is free of data shifts.
# We use :py:func:`pvanalytics.quality.data_shifts.detect_data_shifts` to
# detect data shifts in the time series.

# Resample the time series to daily mean
time_series_daily = time_series.resample('D').mean()
data_shift_start_date, data_shift_end_date = \
ds.get_longest_shift_segment_dates(time_series_daily)
data_shift_period_length = (data_shift_end_date - data_shift_start_date).days

# Get the number of shift dates
data_shift_mask = pvanalytics.quality.data_shifts.detect_data_shifts(
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved
time_series_daily)
# Get the shift dates
shift_dates = list(time_series_daily[data_shift_mask].index)
if len(shift_dates) > 0:
shift_found = True
else:
shift_found = False

# Visualize the time shifts for the daily time series
print("Shift Found:", shift_found)
edges = [time_series_daily.index[0]] + \
shift_dates + [time_series_daily.index[-1]]
fig, ax = plt.subplots()
for (st, ed) in zip(edges[:-1], edges[1:]):
ax.plot(time_series_daily.loc[st:ed])
plt.title("Daily Time Series Labeled for Data Shifts")
plt.show()

# %%
# We filter the time series to only include the longest
# shift-free period.

# Filter the time series to only include the longest shift-free period
time_series = time_series[
(time_series.index >= data_shift_start_date.tz_convert(
time_series.index.tz)) &
(time_series.index <= data_shift_end_date.tz_convert(
time_series.index.tz))]

time_series = time_series.asfreq(data_freq)


# %%
# Display the final irradiance time series, post-QA filtering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plot isn't rendering

Copy link
Member

@cwhanse cwhanse Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That plot still isn't rendering, idk why not. @kperrynrel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's rendering now, not sure what was going on with it:
image

time_series.plot(title="Final Filtered Time Series")
plt.show()
plt.close()
kperrynrel marked this conversation as resolved.
Show resolved Hide resolved

# %%
# Generate a dictionary output for the QA assessment of this data stream,
# including the percent stale and erroneous data detected, any shift dates,
# and any detected time shifts.

qa_check_dict = {"original_time_zone_offset": time_series.index.tz,
"pct_stale": pct_stale,
"pct_negative": pct_negative,
"pct_erroneous": pct_erroneous,
"pct_outlier": pct_outlier,
"time_shifts_detected": time_shifts_detected,
"time_shift_list": time_shift_list,
"data_shifts": shift_found,
"shift_dates": shift_dates}

print("QA Results:")
print(qa_check_dict)
Loading
Loading