Skip to content

Commit

Permalink
TimeSeriesCVSplitter: finalize
Browse files Browse the repository at this point in the history
  • Loading branch information
mdancho84 committed Nov 6, 2024
1 parent dad84b0 commit fe80f10
Show file tree
Hide file tree
Showing 11 changed files with 2,846 additions and 1,375 deletions.
1 change: 1 addition & 0 deletions docs/_sidebar.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ website:
section: "\U0001F4CE TS Features"
- contents:
- reference/TimeSeriesCV.qmd
- reference/TimeSeriesCVSplitter.qmd
section: "\U0001F4C8 Time Series Cross Validation (TSCV)"
- contents:
- reference/augment_macd.qmd
Expand Down
15 changes: 11 additions & 4 deletions docs/_site/reference/TimeSeriesCV.html

Large diffs are not rendered by default.

1,326 changes: 1,326 additions & 0 deletions docs/_site/reference/TimeSeriesCVSplitter.html

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/_site/reference/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,10 @@ <h2 class="anchored" data-anchor-id="time-series-cross-validation-tscv">📈 Tim
<td><a href="../reference/TimeSeriesCV.html#pytimetk.TimeSeriesCV">TimeSeriesCV</a></td>
<td><code>TimeSeriesCV</code> is a subclass of <code>TimeBasedSplit</code> with default mode set to ‘backward’</td>
</tr>
<tr class="even">
<td><a href="../reference/TimeSeriesCVSplitter.html#pytimetk.TimeSeriesCVSplitter">TimeSeriesCVSplitter</a></td>
<td>The <code>TimeSeriesCVSplitter</code> is a scikit-learn compatible cross-validator using <code>TimeSeriesCV</code>.</td>
</tr>
</tbody>
</table>
</section>
Expand Down
2,361 changes: 1,205 additions & 1,156 deletions docs/_site/search.json

Large diffs are not rendered by default.

272 changes: 138 additions & 134 deletions docs/_site/sitemap.xml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/objects.json

Large diffs are not rendered by default.

14 changes: 8 additions & 6 deletions docs/reference/TimeSeriesCV.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,16 @@ and an optional `split_limit` to return the first `n` slices of time series cros
## Raises:

ValueError:
- If `frequency` is not one of "days", "seconds", "microseconds", "milliseconds", "minutes", "hours",
"weeks".
- If `window` is not one of "rolling" or "expanding".
- If `mode` is not one of "forward" or "backward"
- If `train_size`, `forecast_horizon`, `gap` or `stride` are not strictly positive.

- If `frequency` is not one of "days", "seconds", "microseconds", "milliseconds", "minutes", "hours",
"weeks".
- If `window` is not one of "rolling" or "expanding".
- If `mode` is not one of "forward" or "backward"
- If `train_size`, `forecast_horizon`, `gap` or `stride` are not strictly positive.

TypeError:
If `train_size`, `forecast_horizon`, `gap` or `stride` are not of type `int`.

If `train_size`, `forecast_horizon`, `gap` or `stride` are not of type `int`.

## Examples:

Expand Down
137 changes: 137 additions & 0 deletions docs/reference/TimeSeriesCVSplitter.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# TimeSeriesCVSplitter { #pytimetk.TimeSeriesCVSplitter }

`TimeSeriesCVSplitter(self, *, frequency, train_size, forecast_horizon, time_series, gap=0, stride=None, window='rolling', mode='backward', start_dt=None, end_dt=None, split_limit=None)`

The `TimeSeriesCVSplitter` is a scikit-learn compatible cross-validator using `TimeSeriesCV`.

This cross-validator generates splits based on time values, making it suitable for time series data.

## Parameters:

frequency: str
The frequency of the time series (e.g., "days", "hours").
train_size: int
Minimum number of time units in the training set.
forecast_horizon: int
Number of time units to forecast in each split.
time_series: pd.Series
A pandas Series or Index representing the time values.
gap: int
Number of time units to skip between training and testing sets.
stride: int
Number of time units to move forward after each split.
window: str
Type of window, either "rolling" or "expanding".
mode: str
Order of split generation, "forward" or "backward".
start_dt: pd.Timestamp
Start date for the time period.
end_dt: pd.Timestamp
End date for the time period.
split_limit: int
Maximum number of splits to generate. If None, all possible splits will be generated.

## Raises:

ValueError:
If the input arrays are incompatible in length with the time series.

## Returns:

A generator of tuples of arrays containing the training and forecast data.

## See Also:

TimeSeriesCV

## Examples

``` {python}
import pandas as pd
import numpy as np
from pytimetk import TimeSeriesCVSplitter
start_dt = pd.Timestamp(2023, 1, 1)
end_dt = pd.Timestamp(2023, 1, 31)
time_series = pd.Series(pd.date_range(start_dt, end_dt, freq="D"))
size = len(time_series)
df = pd.DataFrame(data=np.random.randn(size, 2), columns=["a", "b"])
X, y = df[["a", "b"]], df[["a", "b"]].sum(axis=1)
cv = TimeSeriesCVSplitter(
time_series=time_series,
frequency="days",
train_size=14,
forecast_horizon=7,
gap=0,
stride=1,
window="rolling",
)
cv
```

``` {python}
# Insepct the cross-validation splits
cv.splitter.plot(y, time_series = time_series)
```

``` {python}
# Using the TimeSeriesCVSplitter in a scikit-learn CV model
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
# Fit and get best estimator
param_grid = {
"alpha": np.linspace(0.1, 2, 10),
"fit_intercept": [True, False],
"positive": [True, False],
}
random_search_cv = RandomizedSearchCV(
estimator=Ridge(),
param_distributions=param_grid,
cv=cv,
n_jobs=-1,
).fit(X, y)
random_search_cv.best_estimator_
```

## Methods

| Name | Description |
| --- | --- |
| [get_n_splits](#pytimetk.TimeSeriesCVSplitter.get_n_splits) | Returns the number of splits. |
| [split](#pytimetk.TimeSeriesCVSplitter.split) | Generates train and test indices for cross-validation. |

### get_n_splits { #pytimetk.TimeSeriesCVSplitter.get_n_splits }

`TimeSeriesCVSplitter.get_n_splits(X=None, y=None, groups=None)`

Returns the number of splits.

### split { #pytimetk.TimeSeriesCVSplitter.split }

`TimeSeriesCVSplitter.split(X=None, y=None, groups=None)`

Generates train and test indices for cross-validation.

#### Parameters:

X:
Optional input features (ignored, for compatibility with scikit-learn).
y:
Optional target variable (ignored, for compatibility with scikit-learn).
groups:
Optional group labels (ignored, for compatibility with scikit-learn).

#### Yields:

Tuple[np.ndarray, np.ndarray]:
Tuples of train and test indices.
1 change: 1 addition & 0 deletions docs/reference/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Time series cross validation.
| | |
| --- | --- |
| [TimeSeriesCV](TimeSeriesCV.qmd#pytimetk.TimeSeriesCV) | `TimeSeriesCV` is a subclass of `TimeBasedSplit` with default mode set to 'backward' |
| [TimeSeriesCVSplitter](TimeSeriesCVSplitter.qmd#pytimetk.TimeSeriesCVSplitter) | The `TimeSeriesCVSplitter` is a scikit-learn compatible cross-validator using `TimeSeriesCV`. |

## 💹 Finance Module (Momentum Indicators)

Expand Down
88 changes: 14 additions & 74 deletions src/pytimetk/crossvalidation/time_series_cv.py
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,8 @@ class TimeSeriesCVSplitter(BaseCrossValidator):
Start date for the time period.
end_dt: pd.Timestamp
End date for the time period.
split_limit: int
Maximum number of splits to generate. If None, all possible splits will be generated.
Raises:
-------
Expand Down Expand Up @@ -510,8 +512,8 @@ class TimeSeriesCVSplitter(BaseCrossValidator):
cv = TimeSeriesCVSplitter(
time_series=time_series,
frequency="days",
train_size=7,
forecast_horizon=11,
train_size=14,
forecast_horizon=7,
gap=0,
stride=1,
window="rolling",
Expand All @@ -520,7 +522,12 @@ class TimeSeriesCVSplitter(BaseCrossValidator):
cv
```
``` python
``` {python}
# Insepct the cross-validation splits
cv.splitter.plot(y, time_series = time_series)
```
``` {python}
# Using the TimeSeriesCVSplitter in a scikit-learn CV model
from sklearn.linear_model import Ridge
Expand Down Expand Up @@ -557,6 +564,7 @@ def __init__(
mode: str = "backward",
start_dt: pd.Timestamp = None,
end_dt: pd.Timestamp = None,
split_limit: int = None,
):
self.splitter = TimeSeriesCV(
frequency=frequency,
Expand All @@ -566,6 +574,7 @@ def __init__(
stride=stride,
window=window,
mode=mode,
split_limit=split_limit
)
self.time_series_ = time_series
self.start_dt_ = start_dt
Expand All @@ -591,6 +600,8 @@ def split(
Optional group labels (ignored, for compatibility with scikit-learn).
Yields:
-------
Tuple[np.ndarray, np.ndarray]:
Tuples of train and test indices.
"""
self._validate_split_args(self.size_, X, y, groups)
Expand Down Expand Up @@ -639,76 +650,5 @@ def _validate_split_args(









# class TimeSeriesCV:
# """Generates tuples of train_idx, test_idx pairs
# Assumes the MultiIndex contains levels 'symbol' and 'date'
# purges overlapping outcomes. Includes a shift for each test set."""

# def __init__(
# self,
# n_splits=3,
# train_period_length=126,
# test_period_length=21,
# lookahead=None,
# shift_length=0, # New parameter to specify the shift length
# date_idx='date',
# shuffle=False,
# seed=None,
# ):
# self.n_splits = n_splits
# self.lookahead = lookahead
# self.test_length = test_period_length
# self.train_length = train_period_length
# self.shift_length = shift_length # Store the shift length
# self.shuffle = shuffle
# self.seed = seed
# self.date_idx = date_idx

# def split(self, X, y=None, groups=None):
# unique_dates = X.index.get_level_values(self.date_idx).unique()
# days = sorted(unique_dates, reverse=True)

# splits = []
# for i in range(self.n_splits):
# # Adjust the end index for the test set to include the shift for subsequent splits
# test_end_idx = i * self.test_length + i * self.shift_length
# test_start_idx = test_end_idx + self.test_length
# train_end_idx = test_start_idx + self.lookahead - 1
# train_start_idx = train_end_idx + self.train_length + self.lookahead - 1

# if train_start_idx >= len(days):
# break # Break if the start index goes beyond the available data

# dates = X.reset_index()[[self.date_idx]]
# train_idx = dates[(dates[self.date_idx] > days[min(train_start_idx, len(days)-1)])
# & (dates[self.date_idx] <= days[min(train_end_idx, len(days)-1)])].index
# test_idx = dates[(dates[self.date_idx] > days[min(test_start_idx, len(days)-1)])
# & (dates[self.date_idx] <= days[min(test_end_idx, len(days)-1)])].index

# if self.shuffle:
# if self.seed is not None:
# np.random.seed(self.seed)

# train_idx_list = list(train_idx)
# np.random.shuffle(train_idx_list)
# train_idx = np.array(train_idx_list)
# else:
# train_idx = train_idx.to_numpy()

# test_idx = test_idx.to_numpy()

# splits.append((train_idx, test_idx))

# return splits

# def get_n_splits(self, X=None, y=None, groups=None):
# """Adjusts the number of splits if there's not enough data for the desired configuration."""
# return self.n_splits


0 comments on commit fe80f10

Please sign in to comment.