Overhaul augment_rolling to streamline window function handling #89

alexriggio · 2023-10-05T06:29:27Z

Proposing an overhaul of the augment functions to be able to handle all variety of window function types in one call.

Introduced window_func and window_func_with_iv to cleanly separate Series-based and DataFrame-based rolling functions.
Removed use_independent_variables argument, enhancing function flexibility.
Allows for diverse window function types in one call eliminating the need to make multiple augment calls and dataframe concatenations to get all desired augmented features.

Example of the type of call that would be possible:

rolled_df = (
        df.augment_rolling(
            date_column = 'date', 
            value_column = ['value1', 'value2', 'value3'], 
            window=[2, 4],
            window_func = [
                'mean',
                'std',
                ('sample_std', lambda x: sample_std(x)),
                ('pop_std', lambda x: population_std(x))
            ],
            window_func_with_iv = [
                ('corr', lambda x: x['value1'].corr(x['value2'])),
                ('regression', regression)
                ],
            min_periods=1,
            center=True
        )
)
    rolled_df

The problem addressed:

Functions relying on a series (a value column) cycle through each value column in a list.

Functions relying on a dataframe (independent variables) get repeatedly called and duplicated in the dataframe if there is > 1 value listed in the value column.

Also gets a little messy with lambda functions depending on if you have them operating on a series or data frame.
lambda x: func(x) -- vs --- lambda x: func(x[specific_col])
Passing multiple lambda functions of different forms can cause problems.

Users wanting a mix of various types of window function calls would need run the multiple augment calls and concatenate resulting dataframes to get one dataframe with all desired features.

…handling

mdancho84 · 2023-10-05T11:37:50Z

This is interesting. I wonder if it can be simplified even further by having all calculations run though the rolling_apply() function.

This way there would be no need for a second argument and we can just deprecate the use_independent_variables parameter.

mdancho84 · 2023-10-05T11:49:00Z

To add to this, there is a performance boost if we don't use apply()-style for simple functions like mean() and std().

So one thing I am thinking is running certain calls that are not Tuples through the rolling() function.

Tuples will contain the custom functions. They can be run through roll_apply().

What do you think? Let's discuss in Slack.

mdancho84 · 2023-10-06T11:56:59Z

Sorry I'm taking a long time with this one.

I'm researching what packages like polars and pandas do to help solve this problem.

It looks like polars works by removing the value column entirely and just having functions with Alias to rename.

I'm considering adopting a similar framework to eliminate the need for 2 window functions and specification of a value column.

import polars as pl

# Sample data
data = {
    'Category': ['X', 'X', 'Y', 'Y', 'Z', 'Z'],
    'A': [1, 2, 3, 4, 5, 6],
    'B': [6, 5, 4, 3, 2, 1]
}
df = pl.DataFrame(data)

# Define a custom aggregation function to compute correlation
def correlation_agg(s: pl.Expr) -> pl.Expr:
    a = s.filter(s.name == "A").cast(pl.Float64)
    b = s.filter(s.name == "B").cast(pl.Float64)
    return a.corr(b)

# Calculate correlation within each group using groupby and agg
correlations = df.groupby("Category").agg(correlation_agg(pl.col(["A", "B"])).alias("correlation"))
print(correlations)

mdancho84 · 2023-10-06T12:00:10Z

Basically we would

use Tuples to define the Alias and the custom function.
The custom function would be passed the group as a data frame.
The custom function would determine which columns to work on
The return would be a pandas series (or list like object)

mdancho84 · 2023-10-06T12:03:18Z

Alternatively we can split into 2 functions:

value column (simple series calculation like mean, std)
groups exposing all variables in the group - used for more complex rolling regressions

alexriggio · 2023-10-06T18:06:17Z

As per @mdancho84 suggestions in our Slack conversation.

augment_rolling now handles series-based functions.
augment_rolling_apply introduced for dataframe-based functions.

Overhaul �ugment_rolling in rolling.py to streamline window function …

d696944

…handling

Refactor: Split into augment_rolling and augment_rolling_apply

652aad3

mdancho84 merged commit abf7db8 into business-science:master Oct 6, 2023
5 checks passed

alexriggio deleted the window_functions branch October 6, 2023 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul augment_rolling to streamline window function handling #89

Overhaul augment_rolling to streamline window function handling #89

alexriggio commented Oct 5, 2023

mdancho84 commented Oct 5, 2023

mdancho84 commented Oct 5, 2023 •

edited

Loading

mdancho84 commented Oct 6, 2023

mdancho84 commented Oct 6, 2023

mdancho84 commented Oct 6, 2023

alexriggio commented Oct 6, 2023

Overhaul augment_rolling to streamline window function handling #89

Overhaul augment_rolling to streamline window function handling #89

Conversation

alexriggio commented Oct 5, 2023

mdancho84 commented Oct 5, 2023

mdancho84 commented Oct 5, 2023 • edited Loading

mdancho84 commented Oct 6, 2023

mdancho84 commented Oct 6, 2023

mdancho84 commented Oct 6, 2023

alexriggio commented Oct 6, 2023

mdancho84 commented Oct 5, 2023 •

edited

Loading