-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2
Conversation
…days and query ShopperTrak for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a few questions about approach/structure but the code looks solid and no feedback on that.
As far as I can tell there are no blockers but just wanted to check a few things before giving formal approval.
lib/pipeline_controller.py
Outdated
@@ -36,6 +60,14 @@ def __init__(self): | |||
|
|||
def run(self): | |||
"""Main method for the class -- runs the pipeline""" | |||
self.logger.info("Getting regular branch hours from Redshift") | |||
location_hours_dict = self.get_location_hours_dict() | |||
self.shoppertrak_api_client = ShopperTrakApiClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage to moving this out of the constructor? Feels like a more natural fit to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it out because the location_hours_dict has to be set now. I just changed it so that the client is created in the constructor with an empty location_hours_dict which is then set down below.
create_table_query = build_redshift_create_table_query( | ||
self.redshift_visits_table, start_date, end_date | ||
) | ||
self.redshift_client.connect() | ||
self.redshift_client.execute_transaction([(create_table_query, None)]) | ||
recoverable_site_dates = self.redshift_client.execute_query( | ||
REDSHIFT_RECOVERABLE_QUERY | ||
) | ||
known_data = self.redshift_client.execute_query( | ||
build_redshift_known_query(self.redshift_visits_table) | ||
) | ||
self.redshift_client.execute_transaction([(REDSHIFT_DROP_QUERY, None)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of creating the temporary table here? It seems like it's being joined back to the source table so could this be done with a single query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't love it and went back and forth -- I'd definitely be open to suggestions! The problem is I need to find all sites and dates containing at least one unhealthy orbit, which involves filtering on a few fields and then grouping (... WHERE NOT is_healthy_data AND is_fresh AND increment_start >= '{start_date}' AND increment_start < '{end_date}' GROUP BY shoppertrak_site_id, increment_date;
), twice: once to get the site/date pairs that I need to query the API for and then again to find the healthy data for those same sites/dates (which is why the temp table gets joined back to the regular table). To avoid creating a temporary table, I have a few options.
I could of course just do the filtering/grouping twice, but that's inefficient. I could also use the values from the first query in an IN
condition, but that would require either concatenating the site id and date into a single string (WHERE CONCAT(shoppertrak_site_id, increment_start::DATE) IN (...)
), which feels messy, or doing a multicolumn check (WHERE (shoppertrak_site_id, increment_start::DATE) IN (SELECT site1, date1 UNION SELECT site1, date1 UNION ...)
, which has atrocious syntax. Finally, I could avoid doing the filtering/grouping on the server and get all of the data at once and then filter/group in Python. That would probably require me to use Pandas, though, which so far the code hasn't required, and is of course space inefficient. Of all these options I chose the temporary table but could be persuaded to do the last one. What do you think?
recoverable_site_dates = self.redshift_client.execute_query( | ||
REDSHIFT_RECOVERABLE_QUERY | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps move this down, I had to hunt around for where this variable was set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it has to go in between the temporary table creation/deletion, so it could only really move down one line.
self.redshift_client.execute_transaction([(update_query, None)]) | ||
|
||
if results: | ||
encoded_records = self.avro_encoder.encode_batch(results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Avro validate these records? They look like they're missing fields from the "full" ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry what do you mean by the "full" ones? They should have all the same fields as the all_sites records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, hmm, maybe I misread this but I thought the fresh_row
objects were only getting a subset of the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
No description provided.