-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Once a month dag to fully replace signals #221
base: production
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little curious what will happen on the 1st of May but let's see 🤞
dag_id=f"atd_knack_signals_full_replace", | ||
description="Load a full replace of signals (view_197) records from Knack to Postgrest to AGOL and Socrata", | ||
default_args=DEFAULT_ARGS, | ||
schedule_interval="28 2 1 * *" if DEPLOYMENT_ENVIRONMENT == "production" else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chiaberry One thing that crossed my mind when you mentioned figuring out how to pause the other DAG is the max_active_runs
setting for a DAG. I'm a little weary of having multiple DAGs for different versions of the same process, and I'm wondering if there is some combination of a schedule (the complexity here might be a stretch for cron expressions but idk), branching logic or something to toggle the full replace arg, and using the max runs setting to make sure that no more than one DAG run occurs at the same time to let the full replace play out before the next run starts.
I'm know that you've put time into this already so maybe it would be a time sink, and we can always think about it more while we test this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am also wary of running multiple instances of this ETL at the same time, and agree with Mike that it would be a good idea to handle this with DAG code.
one other wrinkle, iirc, is that the effect of a full replace is that the modified date of every record in the postgres db is going to be updated to the current date, and so every single time the DAG runs for the rest of the day it is going to process all records, because the postgres > socrata ETL queries by date, not timestamp.
i don't know, you might want to altogether abandon running a full replace on this dataset. the current ETL is not optimized to deal with situation.
Associated issues
cityofaustin/atd-data-tech#15525
Our knack datasets do a full replace on the first day of the month, but the knack signals dag runs every 5 minutes. So every 5 minutes on the first of the month, we would try to do a full replace, which takes more than 5 minutes. So failures all day.
I turned off the monthly full replace (#205), so this PR is to bring back a full replace once a month.
Associated repo
This is equivalent to the existing knack_signals dag, but the run schedule is only once on the first of the month, with no date flag so its always a full replace.
I am not even sure if we need to do full replaces once a month. The other signals dag runs every 5 minutes, trying to see how to pause that one while this one runs.....
Ship list