Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

Merged
merged 4 commits into from
Apr 24, 2024

Conversation

aaronfriedman6
Copy link
Contributor

No description provided.

Copy link

@mwbenowitz mwbenowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a few questions about approach/structure but the code looks solid and no feedback on that.

As far as I can tell there are no blockers but just wanted to check a few things before giving formal approval.

config/production.yaml Show resolved Hide resolved
@@ -36,6 +60,14 @@ def __init__(self):

def run(self):
"""Main method for the class -- runs the pipeline"""
self.logger.info("Getting regular branch hours from Redshift")
location_hours_dict = self.get_location_hours_dict()
self.shoppertrak_api_client = ShopperTrakApiClient(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage to moving this out of the constructor? Feels like a more natural fit to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it out because the location_hours_dict has to be set now. I just changed it so that the client is created in the constructor with an empty location_hours_dict which is then set down below.

lib/pipeline_controller.py Show resolved Hide resolved
Comment on lines +137 to +148
create_table_query = build_redshift_create_table_query(
self.redshift_visits_table, start_date, end_date
)
self.redshift_client.connect()
self.redshift_client.execute_transaction([(create_table_query, None)])
recoverable_site_dates = self.redshift_client.execute_query(
REDSHIFT_RECOVERABLE_QUERY
)
known_data = self.redshift_client.execute_query(
build_redshift_known_query(self.redshift_visits_table)
)
self.redshift_client.execute_transaction([(REDSHIFT_DROP_QUERY, None)])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of creating the temporary table here? It seems like it's being joined back to the source table so could this be done with a single query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't love it and went back and forth -- I'd definitely be open to suggestions! The problem is I need to find all sites and dates containing at least one unhealthy orbit, which involves filtering on a few fields and then grouping (... WHERE NOT is_healthy_data AND is_fresh AND increment_start >= '{start_date}' AND increment_start < '{end_date}' GROUP BY shoppertrak_site_id, increment_date;), twice: once to get the site/date pairs that I need to query the API for and then again to find the healthy data for those same sites/dates (which is why the temp table gets joined back to the regular table). To avoid creating a temporary table, I have a few options.

I could of course just do the filtering/grouping twice, but that's inefficient. I could also use the values from the first query in an IN condition, but that would require either concatenating the site id and date into a single string (WHERE CONCAT(shoppertrak_site_id, increment_start::DATE) IN (...)), which feels messy, or doing a multicolumn check (WHERE (shoppertrak_site_id, increment_start::DATE) IN (SELECT site1, date1 UNION SELECT site1, date1 UNION ...), which has atrocious syntax. Finally, I could avoid doing the filtering/grouping on the server and get all of the data at once and then filter/group in Python. That would probably require me to use Pandas, though, which so far the code hasn't required, and is of course space inefficient. Of all these options I chose the temporary table but could be persuaded to do the last one. What do you think?

Comment on lines +142 to +144
recoverable_site_dates = self.redshift_client.execute_query(
REDSHIFT_RECOVERABLE_QUERY
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move this down, I had to hunt around for where this variable was set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it has to go in between the temporary table creation/deletion, so it could only really move down one line.

self.redshift_client.execute_transaction([(update_query, None)])

if results:
encoded_records = self.avro_encoder.encode_batch(results)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will Avro validate these records? They look like they're missing fields from the "full" ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what do you mean by the "full" ones? They should have all the same fields as the all_sites records.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, hmm, maybe I misread this but I thought the fresh_row objects were only getting a subset of the fields.

Copy link

@mwbenowitz mwbenowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@aaronfriedman6 aaronfriedman6 merged commit 096563a into main Apr 24, 2024
2 checks passed
@aaronfriedman6 aaronfriedman6 deleted the include-recovery-data branch April 24, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants