You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
It would be nice to be able to express the desire for the new data to "look like" the old data (in terms of distribution).
Describe the solution you'd like
Since spark expectations collects summary stats already adding validation rules to allow there to be tolerances on the difference in today's summary v.s. the previous summary could be a good start.
Describe alternatives you've considered
I suppose we could write a query rule where folks just manually write the SQL query.
@holdenk This is a good idea. We could add another rule_type stats_dq which would complement our existing rule types row_dq, agg_dq and query_dq.
By default, we can offer a view derived from the stats_df we generate. Currently, users specify a stats_table. We can propose a new stats_table_view constructed from the job's stats_df. Additionally, we can read from the current stats_table to establish a stats_table_existing_view.
Leveraging these two views, users can craft queries tailored to their validation needs. For added convenience, we'll include standard query examples in our documentation.
@holdenk@asingamaneni Great feature.
I think at some point we need to provide an interface for users to implement custom rule types that can be integrated with spark-expectations .
Is your feature request related to a problem? Please describe.
It would be nice to be able to express the desire for the new data to "look like" the old data (in terms of distribution).
Describe the solution you'd like
Since spark expectations collects summary stats already adding validation rules to allow there to be tolerances on the difference in today's summary v.s. the previous summary could be a good start.
Describe alternatives you've considered
I suppose we could write a query rule where folks just manually write the SQL query.
Additional context
TFDV goes above and beyond with it's historic views -- https://www.tensorflow.org/tfx/data_validation/get_started
The text was updated successfully, but these errors were encountered: