-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to perform accuracy checks / repair sink data #843
Comments
@Ian2012 @saraburns1 @pomegranited I'd love your thoughts and feedback on this and the accompanying aspects-dbt ticket. I think it's the next major chunk of work we can do while waiting on feedback from v1. |
Sounds cool! I think the sink repair and outage reports should be follow ups since getting the connections set up and doing all the validations should be first priority. Could either be a separate epic or just fast-follow tickets. Also wanted to bring up a probably rare case, but we should think about what happens if MySQL is missing data that exists in Clickhouse (like if a SQL transaction failed after the data was already sent downstream). Would we delete CH data since SQL is the source of truth? Could there be mismatched data from a failed update (in which case CH might be more accurate)? |
Another point to add is data bootstrapping: Suppose I have an existing Open edX instance with lots of users / courses / enrollments, and I add Aspects to the deployment. The sinks would catch any newly triggered events, so would be partially populated, but they will need to be repaired to receive the old data. |
@pomegranited right! It would be able to either run or form the correct management commands to backfill sinks where needed. One nit we should keep in mind is data that is older than the TTL, which we shouldn't alert or backfill on. |
Building on the work in openedx/aspects-dbt#102 I think we should add some checks that an operator can run to see if Aspects matches the source of truth tables for important data. This would require connecting ClickHouse to MySQL and running some potentially large queries, but would also be expected to either be run manually or during low traffic times to prevent causing performance issues.
Some high level thoughts:
dbt test
at any point to check the data in ClickHouse if the tests in the above issue are written in a performant way. My experience in the past with OLAP databases and dbt test makes me think we do want some kind of "fudge factor" for the fact that data is constantly flowing in, either only checking against data more than X minutes old or allowing for some amount of slop around the acceptable test conditions to account for itThis ticket is just to start the conversation, but once we decide on a list of features it can be converted into an epic with a number of sub-tasks.
The text was updated successfully, but these errors were encountered: