feat(metrics): Add method namespacing to indexer cache #58178

john-z-yang · 2023-10-16T17:51:55Z

This PR

Add a namespace parameters to the make_cache_key function, and depending on the feature flag (set to false right now) will create cache keys from the new namespaces that are separate for the read path and write path.
Functionally does not change anything unless the flag is set to True

Next change in this stack PR
#58414

Motivation

Click to expand

The indexer is making the following assumptions with Memcached:

We are only adding key/value pairs to Memcached during bulk_record (the indexer’s “write path”)
Key/value pairs are gone from Memcached after 2 hours

It’s very easy to break 1 or 2 inadvertently. For example, we are inserting key/value pairs back into the cache during a resolve cache miss. Doing so is justified as some strings could be more read heavy than write heavy, such as users with automated scripts that are scanning for a specific key to show up.

Background

Click to expand

How accessing the cache works today in the Indexer

Method	Key exists?	Behaviour
bulk_record	✅	- Use the value from cache - Emit `CACHE_READ` in header
bulk_record	❌	- Fetch value from Postgres - Emit DB_READ in header - Write {string} → {id} to cache
resolve	✅	- Use value from cache
resolve	❌	- Fetch value from Postgres - Write {string} → {id} to cache

where bulk_record = the code path for writing indexed strings

and resolve = the code path for querying strings

Goal

Click to expand

We store a timestamp of when the cache is stored in the value, and also partition the key between the write and read path.

(writer + string) → (integer + timestamp)

How this works

Creates a distinction between, and handles appropriately, both the read and write path of replenishing the cache
Log a metric in this scenario
Simultaneously makes us aware of, and mitigates, unexpected cache behaviors
Actually fixes the issue where we were replenishing the cache on the read path

Method	Key exists?	Timestamp valid?	Behaviour
bulk_record	✅	✅	- Use value from cache - Emit CACHE_READ in header
bulk_record	✅	❌	- Fetch value from Postgres - Emit DB_READ in header - Write br:{string} → {id}:{timestamp} to cache - Increment metric
bulk_record	❌	-	- Fetch value from Postgres - Emit DB_READ in header - Write br:{string} → {id}:{timestamp} to cache
resolve	✅	✅	- Use value from cache
resolve	✅	❌	- Use value from cache - Write res:{string} → {id}:{timestamp} to cache - Increment metric
resolve	❌	-	- Fetch value from Postgres - Write res:{string} → {id}:{timestamp} to cache

Deployment Plan

Click to expand

[THIS PR] Deploy read paths that can toggle between reading from (string) → (integer) and (writer + string) → (integer + timestamp) via a sentry option
Deploy bulk_record changes to write to both types of key/value pairs ((string) → (integer) and (writer + string) → (integer + timestamp)) on each bulk_record lookup
Toggle the option deployed in step 1 to read from (writer + string) → (integer + timestamp)
(observability) Deploy logic to emit metric when timestamp that is greater than 2-3 hours
(mitigation) Deploy logic to reject key/value pairs when timestamp is greater than 2-3 hours
Add back the resolve path cache replenishment

Local Testing

Click to expand

Started sentry with memcache and kcat on snuba-generic-metrics

kcat -b localhost:9092 -t snuba-generic-metrics -C \
  -f '\nKey (%K bytes): %k
  Value (%S bytes): %s
  Timestamp: %T
  Partition: %p
  Offset: %o
  Headers: %h\n'

Send first batch of 3 messages with the send_metrics script

python bin/send_metrics.py --use-cases custom

Confirm that all 3 values have fetch type of FIRST_SEEN ("f")

Key (-1 bytes):
  Value (495 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"init","67563":"metric_e2e_custom_counter_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"f":{"67563":"metric_e2e_custom_counter_k_AIXH44IB"},"c":{"65555":"c:custom/custom@none"}},"use_case_id":"custom","metric_id":65555,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"c","value":1,"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 408
  Headers: mapping_sources=cfh,metric_type=c

Key (-1 bytes):
  Value (491 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"errored","67562":"metric_e2e_custom_set_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"f":{"67562":"metric_e2e_custom_set_k_AIXH44IB"},"c":{"65556":"s:custom/error@none"}},"use_case_id":"custom","metric_id":65556,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"s","value":[3],"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 409
  Headers: mapping_sources=cfh,metric_type=s

Key (-1 bytes):
  Value (502 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"healthy","67561":"metric_e2e_custom_dist_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"65558":"d:custom/duration@second"},"f":{"67561":"metric_e2e_custom_dist_k_AIXH44IB"}},"use_case_id":"custom","metric_id":65558,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"d","value":[4,5,6],"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 410
  Headers: mapping_sources=cfh,metric_type=d
% Reached end of topic snuba-generic-metrics [0] at offset 411

Send the same batch of 3 messages with the send_metrics script

python bin/send_metrics.py --use-cases custom --rand-str AIXH44IB

Confirm that the new messages all have fetch type of CACHE_HIT ("c")

Key (-1 bytes):
  Value (485 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"errored","67562":"metric_e2e_custom_set_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"67562":"metric_e2e_custom_set_k_AIXH44IB","65556":"s:custom/error@none"}},"use_case_id":"custom","metric_id":65556,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"s","value":[3],"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 411
  Headers: mapping_sources=ch,metric_type=s

Key (-1 bytes):
  Value (496 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"healthy","67561":"metric_e2e_custom_dist_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"65558":"d:custom/duration@second","67561":"metric_e2e_custom_dist_k_AIXH44IB"}},"use_case_id":"custom","metric_id":65558,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"d","value":[4,5,6],"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 412
  Headers: mapping_sources=ch,metric_type=d

Key (-1 bytes):
  Value (489 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"init","67563":"metric_e2e_custom_counter_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"67563":"metric_e2e_custom_counter_k_AIXH44IB","65555":"c:custom/custom@none"}},"use_case_id":"custom","metric_id":65555,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"c","value":1,"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 413
  Headers: mapping_sources=ch,metric_type=c
% Reached end of topic snuba-generic-metrics [0] at offset 414

codecov · 2023-10-16T20:22:50Z

Codecov Report

Merging #58178 (90c62ae) into master (ca804de) will increase coverage by 0.00%.
Report is 172 commits behind head on master.
The diff coverage is 72.72%.

❗ Current head 90c62ae differs from pull request most recent head 3caa6a1. Consider uploading reports for the commit 3caa6a1 to get more accurate results

@@           Coverage Diff            @@
##           master   #58178    +/-   ##
========================================
  Coverage   80.57%   80.57%            
========================================
  Files        5144     5146     +2     
  Lines      224845   225018   +173     
  Branches    37898    37923    +25     
========================================
+ Hits       181166   181310   +144     
- Misses      38143    38168    +25     
- Partials     5536     5540     +4

Files	Coverage Δ
src/sentry/sentry_metrics/indexer/cache.py	`87.09% <72.72%> (-7.91%)`	⬇️

... and 30 files with indirect coverage changes

volokluev · 2023-10-16T23:21:56Z

I think some more context is needed in the PR description:

What is the higher level goal of this PR?
What is the blast radius of this PR? How is it going to be rolled out?
Can this PR be safely reverted?

evanh

I agree with Volo, it would be smart to document the rollout/rollback plan for this. When you enable the new namespaces you will effectively be deleting all the existing keys. What are the implications of that? Rolling back will have a similar problem: the data will be split between two keys. What happens when the cache switches back and starts using the old keys?

ayirr7 · 2023-10-18T16:35:55Z

(Agree with the points regarding documentation above). Overall this looks good to me. From my understanding, the local cache for testing is a Dummy, but I wonder if there's a way around that for testing this locally. If it is not ultimately possible, I think the documentation points above will be quite helpful to have.

john-z-yang · 2023-10-18T17:50:07Z

@volokluev @evanh I updated the description let me know if this make more sense!
This has also been discussed in last week's architecture meeting, I can refer you to the doc on slack if you would like more context

evanh

This makes more sense, thank you!

) # This PR - Read the sentry option `sentry-metrics.indexer.write-new-cache-namespace` (set to False default), and if it's True, write to the new cache schema along with the old cache schema. **Previous change in this stack PR** #58178

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 16, 2023

vercel bot deployed to Preview October 16, 2023 17:54 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from a32f323 to b84fc21 Compare October 16, 2023 18:16

john-z-yang changed the title ~~add namespace parameter to indexer cache~~ feat(metrics): Add method namespacing to indexer cache Oct 16, 2023

vercel bot deployed to Preview October 16, 2023 18:44 View deployment

Base automatically changed from john/update-caching-idxer-schema-read-option to master October 16, 2023 19:50

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from b84fc21 to ab89fae Compare October 16, 2023 19:51

vercel bot deployed to Preview October 16, 2023 19:53 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from ab89fae to dbc9960 Compare October 16, 2023 20:20

vercel bot deployed to Preview October 16, 2023 20:23 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from dbc9960 to a368310 Compare October 16, 2023 20:51

vercel bot deployed to Preview October 16, 2023 20:54 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from a368310 to 0dfad70 Compare October 16, 2023 23:12

john-z-yang requested a review from ayirr7 October 16, 2023 23:13

john-z-yang marked this pull request as ready for review October 16, 2023 23:13

john-z-yang requested a review from a team as a code owner October 16, 2023 23:13

vercel bot deployed to Preview October 16, 2023 23:14 View deployment

evanh reviewed Oct 18, 2023

View reviewed changes

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 0dfad70 to c3eab94 Compare October 18, 2023 20:33

vercel bot deployed to Preview October 18, 2023 20:37 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from c3eab94 to 740ca29 Compare October 18, 2023 20:59

vercel bot deployed to Preview October 18, 2023 21:01 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 740ca29 to d73c1b9 Compare October 18, 2023 21:19

vercel bot deployed to Preview October 18, 2023 21:21 View deployment

ayirr7 approved these changes Oct 19, 2023

View reviewed changes

evanh approved these changes Oct 19, 2023

View reviewed changes

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from d73c1b9 to 5e76277 Compare October 20, 2023 16:58

vercel bot deployed to Preview October 20, 2023 17:02 View deployment

john-z-yang mentioned this pull request Oct 20, 2023

feat(metrics): Add ability for indexer cache to write new schema #58414

Merged

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 5e76277 to 57e4ecf Compare October 20, 2023 19:06

vercel bot deployed to Preview October 20, 2023 19:08 View deployment

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 57e4ecf to 90c62ae Compare October 20, 2023 20:11

vercel bot deployed to Preview October 20, 2023 20:14 View deployment

add namespace parameter to indexer cache

3caa6a1

john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 90c62ae to 3caa6a1 Compare October 24, 2023 21:53

vercel bot deployed to Preview October 24, 2023 21:55 View deployment

john-z-yang enabled auto-merge (squash) October 24, 2023 22:22

john-z-yang merged commit fd736cc into master Oct 24, 2023
47 checks passed

john-z-yang deleted the john/update-caching-idxer-schema-read branch October 24, 2023 22:31

github-actions bot locked and limited conversation to collaborators Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): Add method namespacing to indexer cache #58178

feat(metrics): Add method namespacing to indexer cache #58178

john-z-yang commented Oct 16, 2023 •

edited

Loading

codecov bot commented Oct 16, 2023 •

edited

Loading

volokluev commented Oct 16, 2023

evanh left a comment

ayirr7 commented Oct 18, 2023 •

edited

Loading

john-z-yang commented Oct 18, 2023 •

edited

Loading

evanh left a comment

feat(metrics): Add method namespacing to indexer cache #58178

feat(metrics): Add method namespacing to indexer cache #58178

Conversation

john-z-yang commented Oct 16, 2023 • edited Loading

This PR

Motivation

Background

Goal

How this works

Deployment Plan

Local Testing

codecov bot commented Oct 16, 2023 • edited Loading

Codecov Report

volokluev commented Oct 16, 2023

evanh left a comment

Choose a reason for hiding this comment

ayirr7 commented Oct 18, 2023 • edited Loading

john-z-yang commented Oct 18, 2023 • edited Loading

evanh left a comment

Choose a reason for hiding this comment

john-z-yang commented Oct 16, 2023 •

edited

Loading

codecov bot commented Oct 16, 2023 •

edited

Loading

ayirr7 commented Oct 18, 2023 •

edited

Loading

john-z-yang commented Oct 18, 2023 •

edited

Loading