Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dependency of interpolate function on resampling #222

Open
smvazirizade opened this issue Jun 25, 2022 · 4 comments
Open

dependency of interpolate function on resampling #222

smvazirizade opened this issue Jun 25, 2022 · 4 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@smvazirizade
Copy link

smvazirizade commented Jun 25, 2022

This issue is regarding interpolate function. It seems the arguments freq and func should always be provided, either as an argument in interpolate function or through resampling function.

Assume, I have a timeseries data with unequal timesteps. For some reason, some of the samples are None values and I want to replace them with linear interpolation without adding or removing any rows (please see the figure below). However, the dependency of interpolate function on resampling makes this process virtually impossible.
Please let me know if there is any solution for that or that would be great if this feature can be implemented.

image

Also, it causes other problems too. Assume there is a string column that is not used for either partitioning or interpolation. The interpolation function dumps those columns, and by enforcing resampling, I have no chance to merge them to bring them back.

PS: I believe string columns should be allowed with ffill and bfill.

@tnixon
Copy link
Contributor

tnixon commented Jul 5, 2022

I think you make a good point @smvazirizade, we should probably have a cleaner separation of interpolation for filling in missing values from interpolation for the purposes of upsampling.

@guanjieshen - thoughts?

@tnixon tnixon added bug Something isn't working enhancement New feature or request labels Jul 5, 2022
@guanjieshen
Copy link
Contributor

guanjieshen commented Jul 13, 2022

Thanks for bring this up @smvazirizade. In your example, would you expect the null value be filled in with 6 or 6.142857143?

If it's the later 6.142857143, as a temporary workaround one of the ways you can leverage the current interpolate function to achieve what you're looking is something like the following (however this assumes the freq is set to the minimum interval size present in the series):

df = input_tsdf.interpolate(
    partition_cols=["signal"],
    ts_col="timestamp"
    freq="1 minute",
    func="mean",
    method="linear",
    target_cols= ["value"],
    show_interpolated=True,
).df.filter(df.is_ts_interpolated == True)

This uses the supplemental boolean columns to exclude any new timestamps that were generated by the interpolation.

If the intention is for it to be 6 then the timestamp does not a factor into the calculation of the new value; therefore, this can be calculated using the pyspark window function along with lag and lead operation.

@tnixon Good call out on the distinction between upsampling vs filling in missing values use cases. I think is something we can incorporate into the API for an upcoming release.

The ability to perform ffill or bfill on STRING columns is something we can also add support for fairly easily.

@smvazirizade
Copy link
Author

Thank you for the response and attention to this issue.
I am looking for 6.142857143.
However, please keep in mind that this was just an example to explain the issue. In reality, my time intervals are not uniform. Furthermore, this solution is not optimal since I am calculating for extra values which I will dump at the end.
Thank you

@smvazirizade
Copy link
Author

Hi @tnixon and @guanjieshen ,
I was wondering if there are any updates on this.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants