Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Once a month dag to fully replace signals #221

Open
wants to merge 1 commit into
base: production
Choose a base branch
from

Conversation

chiaberry
Copy link
Member

@chiaberry chiaberry commented Apr 18, 2024

Associated issues

cityofaustin/atd-data-tech#15525

Our knack datasets do a full replace on the first day of the month, but the knack signals dag runs every 5 minutes. So every 5 minutes on the first of the month, we would try to do a full replace, which takes more than 5 minutes. So failures all day.

I turned off the monthly full replace (#205), so this PR is to bring back a full replace once a month.

Associated repo

This is equivalent to the existing knack_signals dag, but the run schedule is only once on the first of the month, with no date flag so its always a full replace.

I am not even sure if we need to do full replaces once a month. The other signals dag runs every 5 minutes, trying to see how to pause that one while this one runs.....


Ship list

  • Code reviewed
  • Product manager approved
  • Add note to 1PW secrets moved to API vault and check for duplicates

Copy link
Collaborator

@Charlie-Henry Charlie-Henry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little curious what will happen on the 1st of May but let's see 🤞

dag_id=f"atd_knack_signals_full_replace",
description="Load a full replace of signals (view_197) records from Knack to Postgrest to AGOL and Socrata",
default_args=DEFAULT_ARGS,
schedule_interval="28 2 1 * *" if DEPLOYMENT_ENVIRONMENT == "production" else None,
Copy link
Collaborator

@mddilley mddilley Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chiaberry One thing that crossed my mind when you mentioned figuring out how to pause the other DAG is the max_active_runs setting for a DAG. I'm a little weary of having multiple DAGs for different versions of the same process, and I'm wondering if there is some combination of a schedule (the complexity here might be a stretch for cron expressions but idk), branching logic or something to toggle the full replace arg, and using the max runs setting to make sure that no more than one DAG run occurs at the same time to let the full replace play out before the next run starts.

I'm know that you've put time into this already so maybe it would be a time sink, and we can always think about it more while we test this out.

Copy link
Member

@johnclary johnclary Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am also wary of running multiple instances of this ETL at the same time, and agree with Mike that it would be a good idea to handle this with DAG code.

one other wrinkle, iirc, is that the effect of a full replace is that the modified date of every record in the postgres db is going to be updated to the current date, and so every single time the DAG runs for the rest of the day it is going to process all records, because the postgres > socrata ETL queries by date, not timestamp.

i don't know, you might want to altogether abandon running a full replace on this dataset. the current ETL is not optimized to deal with situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants