Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MC-1608 Add job for New Tab engagement by corpus_item_id #6743

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mmiermans
Copy link
Collaborator

@mmiermans mmiermans commented Jan 2, 2025

Description

This PR adds a column corpus_item_id to the New Tab engagement data consumed by Merino.

The job will continue to aggregate on scheduled_corpus_item_id. A scheduled corpus item is an item that was scheduled on a particular date. In Q1, we will start recommending some items that have not been scheduled for a particular date, and therefore do not have a scheduled_corpus_item_id. We want to aggregate clicks and impressions both by corpus_item_id and by scheduled_corpus_item_id. In the backend, every scheduled_corpus_item_id maps to exactly one corpus_item_id. In the Glean data, corpus_item_id is often null because we only start emitting it from Firefox Desktop >= 134.

QA

Below is an analysis of the results before and after the changes in the PR.

✅ Engagement metrics remain similar

There aren't yet any recommendations on New Tab that don't have a scheduled_corpus_item_id. As expected, engagement before and after are very similar. Small differences in clicks and impressions are expected because the live table is queried.

Metrics Before After
Row Count 4940 4940
Distinct scheduled_corpus_item_id 589 589
Distinct region 8 8
Total Impression Count 375,254,524 375,282,348
Total Click Count 1,969,843 1,970,013

✅ There is good coverage of corpus_item_id

The vast majority of rows and the items that are seen on New Tab will have a corpus_item_id.

Metric Value
Distinct corpus_item_id 511
% of impressions with a corpus_item_id 99.92%
% of rows with a corpus_item_id 95.99%

Related Tickets & Documents

Reviewer, please follow this checklist

@mmiermans mmiermans force-pushed the MC-1608-newtab-merino-corpus-items branch 2 times, most recently from 8ac9083 to 79804ce Compare January 2, 2025 14:22
@dataops-ci-bot

This comment has been minimized.

This job is similar to bqetl_merino_newtab_extract_to_gcs, but it aggregates
on corpus_item_id instead of scheduled_corpus_item_id. The latter (as the name implies)
identifies a scheduled item. We want to start aggregating on the item itself
because we will stop scheduling them for specific dates as part of an experiment
in Q1.
@mmiermans mmiermans force-pushed the MC-1608-newtab-merino-corpus-items branch from f271ea5 to f26af98 Compare January 2, 2025 14:58
@dataops-ci-bot

This comment has been minimized.

@mmiermans mmiermans requested a review from chelseybeck January 6, 2025 17:17
@mmiermans mmiermans marked this pull request as ready for review January 6, 2025 17:18
@dataops-ci-bot
Copy link

Integration report for "Merge branch 'main' into MC-1608-newtab-merino-corpus-items"

sql.diff

Click to expand!
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/apple_ads_external/ios_app_campaign_stats_v1/bigconfig.yml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/apple_ads_external/ios_app_campaign_stats_v1/bigconfig.yml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/apple_ads_external/ios_app_campaign_stats_v1/bigconfig.yml	2025-01-06 17:19:51.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/apple_ads_external/ios_app_campaign_stats_v1/bigconfig.yml	2025-01-06 17:23:28.000000000 +0000
@@ -1,7 +1,6 @@
 type: BIGCONFIG_FILE
-
 tag_deployments:
-  - collection:
+- collection:
       name: Operational Checks
       notification_channels:
         - slack: '#de-bigeye-triage'
@@ -24,10 +23,10 @@
         metrics:
           - saved_metric_id: is_not_null
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
       - column_selectors:
@@ -35,10 +34,10 @@
         metrics:
           - saved_metric_id: is_2_char_len
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
       - column_selectors:
@@ -46,17 +45,17 @@
         metrics:
           - saved_metric_id: volume
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
           - saved_metric_id: freshness
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/google_ads_derived/android_app_campaign_stats_v1/bigconfig.yml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/google_ads_derived/android_app_campaign_stats_v1/bigconfig.yml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/google_ads_derived/android_app_campaign_stats_v1/bigconfig.yml	2025-01-06 17:19:51.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/google_ads_derived/android_app_campaign_stats_v1/bigconfig.yml	2025-01-06 17:23:29.000000000 +0000
@@ -1,7 +1,6 @@
 type: BIGCONFIG_FILE
-
 tag_deployments:
-  - collection:
+- collection:
       name: Operational Checks
       notification_channels:
         - slack: '#de-bigeye-triage'
@@ -23,10 +22,10 @@
         metrics:
           - saved_metric_id: is_not_null
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
       - column_selectors:
@@ -34,10 +33,10 @@
         metrics:
           - saved_metric_id: is_2_char_len
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
       - column_selectors:
@@ -45,17 +44,17 @@
         metrics:
           - saved_metric_id: volume
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
           - saved_metric_id: freshness
             lookback:
-              lookback_type: DATA_TIME
               lookback_window:
                 interval_type: DAYS
                 interval_value: 28
+        lookback_type: DATA_TIME
             rct_overrides:
               - date
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql	2025-01-06 17:19:52.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/checks.sql	2025-01-06 17:19:52.000000000 +0000
@@ -1,9 +1,6 @@
 -- macro checks
 
 #fail
-{{ not_null(["scheduled_corpus_item_id"]) }}
-
-#fail
 {{ not_null(["impression_count"]) }}
 
 #fail
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql	2025-01-06 17:19:52.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/query.sql	2025-01-06 17:19:52.000000000 +0000
@@ -3,7 +3,7 @@
     submission_timestamp,
     document_id,
     events,
-    normalized_country_code,
+    normalized_country_code
   FROM
     `moz-fx-data-shared-prod.firefox_desktop_live.newtab_v1`
   WHERE
@@ -27,48 +27,82 @@
       unnested_events.extra,
       'scheduled_corpus_item_id'
     ) AS scheduled_corpus_item_id,
+    mozfun.map.get_key(unnested_events.extra, 'corpus_item_id') AS corpus_item_id,
     TIMESTAMP_MILLIS(
       SAFE_CAST(mozfun.map.get_key(unnested_events.extra, 'recommended_at') AS INT64)
     ) AS recommended_at
   FROM
-    deduplicated_pings,
-    UNNEST(events) AS unnested_events
-    --filter to Pocket events
+    deduplicated_pings dp
+  CROSS JOIN
+    UNNEST(dp.events) AS unnested_events
   WHERE
+    -- Filter to Pocket events only
     unnested_events.category = 'pocket'
     AND unnested_events.name IN ('impression', 'click')
-    --keep only data with a non-null scheduled corpus item ID
-    AND (mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL)
-    AND SAFE_CAST(mozfun.map.get_key(unnested_events.extra, 'recommended_at') AS INT64) IS NOT NULL
+    -- Keep only data with a non-null scheduled_corpus_item_id or corpus_item_id
+    AND (
+      mozfun.map.get_key(unnested_events.extra, 'scheduled_corpus_item_id') IS NOT NULL
+      OR mozfun.map.get_key(unnested_events.extra, 'corpus_item_id') IS NOT NULL
+    )
+    -- Only keep the last day's data.
+    AND TIMESTAMP_MILLIS(
+      SAFE_CAST(mozfun.map.get_key(unnested_events.extra, 'recommended_at') AS INT64)
+    ) > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
 ),
-aggregated_events AS (
+/* Map scheduled_corpus_item_id to corpus_item_id.
+ * In the backend all scheduled_corpus_item_id with the same value map to the same corpus_item_id.
+ * In these events some corpus_item_id are null because it is only emitted by Firefox >= 134.
+ */
+aggregator_scheduled_lookup AS (
   SELECT
     scheduled_corpus_item_id,
-    normalized_country_code,
-    SUM(CASE WHEN event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count,
-    SUM(CASE WHEN event_name = 'click' THEN 1 ELSE 0 END) AS click_count
+    -- Use MAX to get a non-null corpus_item_id if one exists. All non-null values are the same.
+    MAX(corpus_item_id) AS corpus_item_id
   FROM
     flattened_newtab_events
   WHERE
-    recommended_at > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
+    scheduled_corpus_item_id IS NOT NULL
   GROUP BY
-    scheduled_corpus_item_id,
-    normalized_country_code
+    scheduled_corpus_item_id
 ),
+/* Aggregate clicks and impressions by scheduled_corpus_item_id, corpus_item_id, normalized_country_code. */
+aggregated_events AS (
+  SELECT
+    fe.scheduled_corpus_item_id,
+    -- fe.corpus_item_id can be NULL because it's only emitted by Firefox >= 134.
+    COALESCE(fe.corpus_item_id, asl.corpus_item_id) AS corpus_item_id,
+    fe.normalized_country_code,
+    SUM(CASE WHEN fe.event_name = 'impression' THEN 1 ELSE 0 END) AS impression_count,
+    SUM(CASE WHEN fe.event_name = 'click' THEN 1 ELSE 0 END) AS click_count
+  FROM
+    flattened_newtab_events fe
+  LEFT JOIN
+    aggregator_scheduled_lookup asl
+    ON fe.scheduled_corpus_item_id = asl.scheduled_corpus_item_id
+  GROUP BY
+    1,
+    2,
+    3
+),
+/* Aggregate clicks and impressions across all countries. */
 global_aggregates AS (
   SELECT
     scheduled_corpus_item_id,
+    corpus_item_id,
     CAST(NULL AS STRING) AS region,
     SUM(impression_count) AS impression_count,
     SUM(click_count) AS click_count
   FROM
     aggregated_events
   GROUP BY
-    scheduled_corpus_item_id
+    scheduled_corpus_item_id,
+    corpus_item_id
 ),
+/* Aggregate clicks and impressions for country-specific ranking in Merino. */
 country_aggregates AS (
   SELECT
     scheduled_corpus_item_id,
+    corpus_item_id,
     normalized_country_code AS region,
     impression_count,
     click_count
@@ -79,6 +113,7 @@
     -- https://mozilla-hub.atlassian.net/wiki/x/JY3LB
     normalized_country_code IN ('US', 'CA', 'DE', 'CH', 'AT', 'BE', 'GB', 'IE')
 )
+/* Combine the "global" (no region) with the "regional" breakdown. */
 SELECT
   *
 FROM
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml	2025-01-06 17:19:52.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/telemetry_derived/newtab_merino_extract_v1/schema.yaml	2025-01-06 17:19:52.000000000 +0000
@@ -3,6 +3,9 @@
   name: scheduled_corpus_item_id
   type: STRING
 - mode: NULLABLE
+  name: corpus_item_id
+  type: STRING
+- mode: NULLABLE
   name: impression_count
   type: INTEGER
 - mode: NULLABLE

Link to full diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants