Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul augment_rolling to streamline window function handling #89

Merged
merged 2 commits into from
Oct 6, 2023

Conversation

alexriggio
Copy link
Collaborator

Proposing an overhaul of the augment functions to be able to handle all variety of window function types in one call.

  • Introduced window_func and window_func_with_iv to cleanly separate Series-based and DataFrame-based rolling functions.
  • Removed use_independent_variables argument, enhancing function flexibility.
  • Allows for diverse window function types in one call eliminating the need to make multiple augment calls and dataframe concatenations to get all desired augmented features.

Example of the type of call that would be possible:

rolled_df = (
        df.augment_rolling(
            date_column = 'date', 
            value_column = ['value1', 'value2', 'value3'], 
            window=[2, 4],
            window_func = [
                'mean',
                'std',
                ('sample_std', lambda x: sample_std(x)),
                ('pop_std', lambda x: population_std(x))
            ],
            window_func_with_iv = [
                ('corr', lambda x: x['value1'].corr(x['value2'])),
                ('regression', regression)
                ],
            min_periods=1,
            center=True
        )
)
    rolled_df

The problem addressed:

Functions relying on a series (a value column) cycle through each value column in a list.

Functions relying on a dataframe (independent variables) get repeatedly called and duplicated in the dataframe if there is > 1 value listed in the value column.

Also gets a little messy with lambda functions depending on if you have them operating on a series or data frame.
lambda x: func(x) -- vs --- lambda x: func(x[specific_col])
Passing multiple lambda functions of different forms can cause problems.

Users wanting a mix of various types of window function calls would need run the multiple augment calls and concatenate resulting dataframes to get one dataframe with all desired features.

@mdancho84
Copy link
Contributor

This is interesting. I wonder if it can be simplified even further by having all calculations run though the rolling_apply() function.

This way there would be no need for a second argument and we can just deprecate the use_independent_variables parameter.

@mdancho84
Copy link
Contributor

mdancho84 commented Oct 5, 2023

To add to this, there is a performance boost if we don't use apply()-style for simple functions like mean() and std().

So one thing I am thinking is running certain calls that are not Tuples through the rolling() function.

Tuples will contain the custom functions. They can be run through roll_apply().

What do you think? Let's discuss in Slack.

@mdancho84
Copy link
Contributor

Sorry I'm taking a long time with this one.

I'm researching what packages like polars and pandas do to help solve this problem.

It looks like polars works by removing the value column entirely and just having functions with Alias to rename.

I'm considering adopting a similar framework to eliminate the need for 2 window functions and specification of a value column.

import polars as pl

# Sample data
data = {
    'Category': ['X', 'X', 'Y', 'Y', 'Z', 'Z'],
    'A': [1, 2, 3, 4, 5, 6],
    'B': [6, 5, 4, 3, 2, 1]
}
df = pl.DataFrame(data)

# Define a custom aggregation function to compute correlation
def correlation_agg(s: pl.Expr) -> pl.Expr:
    a = s.filter(s.name == "A").cast(pl.Float64)
    b = s.filter(s.name == "B").cast(pl.Float64)
    return a.corr(b)

# Calculate correlation within each group using groupby and agg
correlations = df.groupby("Category").agg(correlation_agg(pl.col(["A", "B"])).alias("correlation"))
print(correlations)

@mdancho84
Copy link
Contributor

Basically we would

  • use Tuples to define the Alias and the custom function.

  • The custom function would be passed the group as a data frame.

  • The custom function would determine which columns to work on

  • The return would be a pandas series (or list like object)

@mdancho84
Copy link
Contributor

Alternatively we can split into 2 functions:

  • value column (simple series calculation like mean, std)
  • groups exposing all variables in the group - used for more complex rolling regressions

@mdancho84 mdancho84 merged commit abf7db8 into business-science:master Oct 6, 2023
5 checks passed
@alexriggio
Copy link
Collaborator Author

As per @mdancho84 suggestions in our Slack conversation.

  • augment_rolling now handles series-based functions.
  • augment_rolling_apply introduced for dataframe-based functions.

@alexriggio alexriggio deleted the window_functions branch October 6, 2023 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants