Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Add method namespacing to indexer cache #58178

Merged
merged 1 commit into from
Oct 24, 2023

Conversation

john-z-yang
Copy link
Member

@john-z-yang john-z-yang commented Oct 16, 2023

This PR

  • Add a namespace parameters to the make_cache_key function, and depending on the feature flag (set to false right now) will create cache keys from the new namespaces that are separate for the read path and write path.
  • Functionally does not change anything unless the flag is set to True

Next change in this stack PR
#58414

Motivation

Click to expand

The indexer is making the following assumptions with Memcached:

  1. We are only adding key/value pairs to Memcached during bulk_record (the indexer’s “write path”)
  2. Key/value pairs are gone from Memcached after 2 hours

It’s very easy to break 1 or 2 inadvertently. For example, we are inserting key/value pairs back into the cache during a resolve cache miss. Doing so is justified as some strings could be more read heavy than write heavy, such as users with automated scripts that are scanning for a specific key to show up.

Background

Click to expand

How accessing the cache works today in the Indexer

Method Key exists? Behaviour
bulk_record - Use the value from cache
- Emit CACHE_READ in header
bulk_record - Fetch value from Postgres
- Emit DB_READ in header
- Write {string} → {id} to cache
resolve - Use value from cache
resolve - Fetch value from Postgres
- Write {string} → {id} to cache

where bulk_record = the code path for writing indexed strings

and resolve = the code path for querying strings

Goal

Click to expand We store a timestamp of when the cache is stored in the value, and also partition the key between the write and read path.

(writer + string) → (integer + timestamp)

How this works

  • Creates a distinction between, and handles appropriately, both the read and write path of replenishing the cache
  • Log a metric in this scenario
  • Simultaneously makes us aware of, and mitigates, unexpected cache behaviors
  • Actually fixes the issue where we were replenishing the cache on the read path
Method Key exists? Timestamp valid? Behaviour
bulk_record - Use value from cache
- Emit CACHE_READ in header
bulk_record - Fetch value from Postgres
- Emit DB_READ in header
- Write br:{string} → {id}:{timestamp} to cache
- Increment metric
bulk_record - - Fetch value from Postgres
- Emit DB_READ in header
- Write br:{string} → {id}:{timestamp} to cache
resolve - Use value from cache
resolve - Use value from cache
- Write res:{string} → {id}:{timestamp} to cache
- Increment metric
resolve - - Fetch value from Postgres
- Write res:{string} → {id}:{timestamp} to cache

Deployment Plan

Click to expand
  1. [THIS PR] Deploy read paths that can toggle between reading from (string) → (integer) and (writer + string) → (integer + timestamp) via a sentry option
  2. Deploy bulk_record changes to write to both types of key/value pairs ((string) → (integer) and (writer + string) → (integer + timestamp)) on each bulk_record lookup
  3. Toggle the option deployed in step 1 to read from (writer + string) → (integer + timestamp)
  4. (observability) Deploy logic to emit metric when timestamp that is greater than 2-3 hours
  5. (mitigation) Deploy logic to reject key/value pairs when timestamp is greater than 2-3 hours
  6. Add back the resolve path cache replenishment

Local Testing

Click to expand

Started sentry with memcache and kcat on snuba-generic-metrics

kcat -b localhost:9092 -t snuba-generic-metrics -C \
  -f '\nKey (%K bytes): %k
  Value (%S bytes): %s
  Timestamp: %T
  Partition: %p
  Offset: %o
  Headers: %h\n'

Send first batch of 3 messages with the send_metrics script

python bin/send_metrics.py --use-cases custom

Confirm that all 3 values have fetch type of FIRST_SEEN ("f")

Key (-1 bytes):
  Value (495 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"init","67563":"metric_e2e_custom_counter_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"f":{"67563":"metric_e2e_custom_counter_k_AIXH44IB"},"c":{"65555":"c:custom/custom@none"}},"use_case_id":"custom","metric_id":65555,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"c","value":1,"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 408
  Headers: mapping_sources=cfh,metric_type=c

Key (-1 bytes):
  Value (491 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"errored","67562":"metric_e2e_custom_set_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"f":{"67562":"metric_e2e_custom_set_k_AIXH44IB"},"c":{"65556":"s:custom/error@none"}},"use_case_id":"custom","metric_id":65556,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"s","value":[3],"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 409
  Headers: mapping_sources=cfh,metric_type=s

Key (-1 bytes):
  Value (502 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"healthy","67561":"metric_e2e_custom_dist_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"65558":"d:custom/duration@second"},"f":{"67561":"metric_e2e_custom_dist_k_AIXH44IB"}},"use_case_id":"custom","metric_id":65558,"org_id":1,"timestamp":1697664813,"project_id":3,"type":"d","value":[4,5,6],"sentry_received_timestamp":1697664813.218}
  Timestamp: 1697664834437
  Partition: 0
  Offset: 410
  Headers: mapping_sources=cfh,metric_type=d
% Reached end of topic snuba-generic-metrics [0] at offset 411

Send the same batch of 3 messages with the send_metrics script

python bin/send_metrics.py --use-cases custom --rand-str AIXH44IB

Confirm that the new messages all have fetch type of CACHE_HIT ("c")

Key (-1 bytes):
  Value (485 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"errored","67562":"metric_e2e_custom_set_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"67562":"metric_e2e_custom_set_k_AIXH44IB","65556":"s:custom/error@none"}},"use_case_id":"custom","metric_id":65556,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"s","value":[3],"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 411
  Headers: mapping_sources=ch,metric_type=s

Key (-1 bytes):
  Value (496 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"healthy","67561":"metric_e2e_custom_dist_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"65558":"d:custom/duration@second","67561":"metric_e2e_custom_dist_k_AIXH44IB"}},"use_case_id":"custom","metric_id":65558,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"d","value":[4,5,6],"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 412
  Headers: mapping_sources=ch,metric_type=d

Key (-1 bytes):
  Value (489 bytes): {"tags":{"9223372036854776010":"production","9223372036854776017":"init","67563":"metric_e2e_custom_counter_v_AIXH44IB"},"version":2,"retention_days":90,"mapping_meta":{"h":{"9223372036854776010":"environment","9223372036854776017":"session.status"},"c":{"67563":"metric_e2e_custom_counter_k_AIXH44IB","65555":"c:custom/custom@none"}},"use_case_id":"custom","metric_id":65555,"org_id":1,"timestamp":1697664926,"project_id":3,"type":"c","value":1,"sentry_received_timestamp":1697664926.943}
  Timestamp: 1697664948036
  Partition: 0
  Offset: 413
  Headers: mapping_sources=ch,metric_type=c
% Reached end of topic snuba-generic-metrics [0] at offset 414

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 16, 2023
@john-z-yang john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from a32f323 to b84fc21 Compare October 16, 2023 18:16
@john-z-yang john-z-yang changed the title add namespace parameter to indexer cache feat(metrics): Add method namespacing to indexer cache Oct 16, 2023
Base automatically changed from john/update-caching-idxer-schema-read-option to master October 16, 2023 19:50
@john-z-yang john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from b84fc21 to ab89fae Compare October 16, 2023 19:51
@john-z-yang john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from ab89fae to dbc9960 Compare October 16, 2023 20:20
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Merging #58178 (90c62ae) into master (ca804de) will increase coverage by 0.00%.
Report is 172 commits behind head on master.
The diff coverage is 72.72%.

❗ Current head 90c62ae differs from pull request most recent head 3caa6a1. Consider uploading reports for the commit 3caa6a1 to get more accurate results

@@           Coverage Diff            @@
##           master   #58178    +/-   ##
========================================
  Coverage   80.57%   80.57%            
========================================
  Files        5144     5146     +2     
  Lines      224845   225018   +173     
  Branches    37898    37923    +25     
========================================
+ Hits       181166   181310   +144     
- Misses      38143    38168    +25     
- Partials     5536     5540     +4     
Files Coverage Δ
src/sentry/sentry_metrics/indexer/cache.py 87.09% <72.72%> (-7.91%) ⬇️

... and 30 files with indirect coverage changes

@volokluev
Copy link
Member

I think some more context is needed in the PR description:

  1. What is the higher level goal of this PR?
  2. What is the blast radius of this PR? How is it going to be rolled out?
  3. Can this PR be safely reverted?

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Volo, it would be smart to document the rollout/rollback plan for this. When you enable the new namespaces you will effectively be deleting all the existing keys. What are the implications of that? Rolling back will have a similar problem: the data will be split between two keys. What happens when the cache switches back and starts using the old keys?

@ayirr7
Copy link
Member

ayirr7 commented Oct 18, 2023

(Agree with the points regarding documentation above). Overall this looks good to me. From my understanding, the local cache for testing is a Dummy, but I wonder if there's a way around that for testing this locally. If it is not ultimately possible, I think the documentation points above will be quite helpful to have.

@john-z-yang
Copy link
Member Author

john-z-yang commented Oct 18, 2023

@volokluev @evanh I updated the description let me know if this make more sense!
This has also been discussed in last week's architecture meeting, I can refer you to the doc on slack if you would like more context

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes more sense, thank you!

@john-z-yang john-z-yang force-pushed the john/update-caching-idxer-schema-read branch from 90c62ae to 3caa6a1 Compare October 24, 2023 21:53
@john-z-yang john-z-yang enabled auto-merge (squash) October 24, 2023 22:22
@john-z-yang john-z-yang merged commit fd736cc into master Oct 24, 2023
47 checks passed
@john-z-yang john-z-yang deleted the john/update-caching-idxer-schema-read branch October 24, 2023 22:31
john-z-yang added a commit that referenced this pull request Oct 26, 2023
)

# This PR

- Read the sentry option
`sentry-metrics.indexer.write-new-cache-namespace` (set to False
default), and if it's True, write to the new cache schema along with the
old cache schema.

**Previous change in this stack PR**
#58178
@github-actions github-actions bot locked and limited conversation to collaborators Nov 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants