Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

aaronfriedman6 · 2024-04-11T18:26:34Z

No description provided.

…days and query ShopperTrak for it

…ion hours

mwbenowitz

I've got a few questions about approach/structure but the code looks solid and no feedback on that.

As far as I can tell there are no blockers but just wanted to check a few things before giving formal approval.

config/production.yaml

mwbenowitz · 2024-04-19T18:08:06Z

lib/pipeline_controller.py

@@ -36,6 +60,14 @@ def __init__(self):

    def run(self):
        """Main method for the class -- runs the pipeline"""
+        self.logger.info("Getting regular branch hours from Redshift")
+        location_hours_dict = self.get_location_hours_dict()
+        self.shoppertrak_api_client = ShopperTrakApiClient(


What is the advantage to moving this out of the constructor? Feels like a more natural fit to me.

I moved it out because the location_hours_dict has to be set now. I just changed it so that the client is created in the constructor with an empty location_hours_dict which is then set down below.

lib/pipeline_controller.py

mwbenowitz · 2024-04-19T18:21:08Z

lib/pipeline_controller.py

+        create_table_query = build_redshift_create_table_query(
+            self.redshift_visits_table, start_date, end_date
+        )
+        self.redshift_client.connect()
+        self.redshift_client.execute_transaction([(create_table_query, None)])
+        recoverable_site_dates = self.redshift_client.execute_query(
+            REDSHIFT_RECOVERABLE_QUERY
+        )
+        known_data = self.redshift_client.execute_query(
+            build_redshift_known_query(self.redshift_visits_table)
+        )
+        self.redshift_client.execute_transaction([(REDSHIFT_DROP_QUERY, None)])


What is the advantage of creating the temporary table here? It seems like it's being joined back to the source table so could this be done with a single query?

Yeah I don't love it and went back and forth -- I'd definitely be open to suggestions! The problem is I need to find all sites and dates containing at least one unhealthy orbit, which involves filtering on a few fields and then grouping (... WHERE NOT is_healthy_data AND is_fresh AND increment_start >= '{start_date}' AND increment_start < '{end_date}' GROUP BY shoppertrak_site_id, increment_date;), twice: once to get the site/date pairs that I need to query the API for and then again to find the healthy data for those same sites/dates (which is why the temp table gets joined back to the regular table). To avoid creating a temporary table, I have a few options.

I could of course just do the filtering/grouping twice, but that's inefficient. I could also use the values from the first query in an IN condition, but that would require either concatenating the site id and date into a single string (WHERE CONCAT(shoppertrak_site_id, increment_start::DATE) IN (...)), which feels messy, or doing a multicolumn check (WHERE (shoppertrak_site_id, increment_start::DATE) IN (SELECT site1, date1 UNION SELECT site1, date1 UNION ...), which has atrocious syntax. Finally, I could avoid doing the filtering/grouping on the server and get all of the data at once and then filter/group in Python. That would probably require me to use Pandas, though, which so far the code hasn't required, and is of course space inefficient. Of all these options I chose the temporary table but could be persuaded to do the last one. What do you think?

mwbenowitz · 2024-04-19T18:29:38Z

lib/pipeline_controller.py

+        recoverable_site_dates = self.redshift_client.execute_query(
+            REDSHIFT_RECOVERABLE_QUERY
+        )


Perhaps move this down, I had to hunt around for where this variable was set.

Unfortunately it has to go in between the temporary table creation/deletion, so it could only really move down one line.

mwbenowitz · 2024-04-19T18:52:59Z

lib/pipeline_controller.py

+        self.redshift_client.execute_transaction([(update_query, None)])
+
+        if results:
+            encoded_records = self.avro_encoder.encode_batch(results)


Will Avro validate these records? They look like they're missing fields from the "full" ones.

Sorry what do you mean by the "full" ones? They should have all the same fields as the all_sites records.

oh, hmm, maybe I misread this but I thought the fresh_row objects were only getting a subset of the fields.

mwbenowitz

👍

Query Redshift to find missing and unrecovered data from the past 30 …

9616ff4

…days and query ShopperTrak for it

aaronfriedman6 requested a review from mwbenowitz April 11, 2024 18:33

Use new LocationVisits schema and determine if imputed based on locat…

9a4ab71

…ion hours

mwbenowitz reviewed Apr 19, 2024

View reviewed changes

Move ShopperTrakAPIClient creation back to __init__

5c5fa26

mwbenowitz approved these changes Apr 22, 2024

View reviewed changes

Update logging

c06a322

aaronfriedman6 merged commit 096563a into main Apr 24, 2024
2 checks passed

aaronfriedman6 deleted the include-recovery-data branch April 24, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

aaronfriedman6 commented Apr 11, 2024

mwbenowitz left a comment

mwbenowitz Apr 19, 2024

aaronfriedman6 Apr 22, 2024

mwbenowitz Apr 19, 2024

aaronfriedman6 Apr 22, 2024

mwbenowitz Apr 19, 2024

aaronfriedman6 Apr 22, 2024

mwbenowitz Apr 19, 2024

aaronfriedman6 Apr 22, 2024

mwbenowitz Apr 22, 2024

mwbenowitz left a comment

Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

Query Redshift to find missing and unrecovered data from the past 30 days and query ShopperTrak for it #2

Conversation

aaronfriedman6 commented Apr 11, 2024

mwbenowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwbenowitz left a comment

Choose a reason for hiding this comment