Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: maintain source and sink statistics through restarts #25218

Merged
merged 11 commits into from
Feb 20, 2024

Conversation

guswynn
Copy link
Contributor

@guswynn guswynn commented Feb 14, 2024

This pr change source and sink statistics to avoid counters/gauges being reset when mz/replicas/sources/sinks are restarted. The motivation behind this is two-fold:

  • It clarifies the behavior of statistics during edge cases, which increases our confidence that the new source progress statistics we are adding make sense
  • It WILDLY simplifies the handling of statistics in the console/by users (particularly when calculating rates)

At its core, its not a huge change, but in practice, it requires a lot of adjustments and code movement to get it right. This pr has many commits, but most are straightforward (tests/code movement/small changes). The two main commits are:

_per_worker -> _raw

7d5735e (#25218)

The former adjusts the _statistics_per_worker collections to _statistics_raw. Because we now need to maintain counters and gauges through changes in the source, and replicas can change their number of workers, its no longer reasonably feasible to hold into per-worker statistics. We lose history whenever we change this schema, so I want to only do this once, which is why I added the new statistics columns, despite them not yet being hooked up. This is the pr to bikeshed the naming and schema of these columns, so we avoid churn in the future.

Note also that we rely on this schema change to be able to safely unpack statistics on envd restart.

The core changes

baa7da7 (#25218)

This commit contains the core changes. All working in concert:

  • Read and unpack the existing current values for statistics from the collection when bootstrapping envd.
  • Enforce the resetting behavior of different counters and gauges with type-level wrappers in mz_storage_client::statistics.
  • Build on storage: aggregate source/sink statistics on a single worker to avoid 2 quirks #25135 to aggregate the per-worker data in clusterd into a single update-per-source/sink to communicate to the controller (this simplifies the controller code)
  • A large large large amount of mechanical changes in support of the above 3 changes. I did my best to make it comprehensible.

Motivation

  • This PR adds a known-desirable feature.

  • This PR refactors existing code.

Tips for reviewer

see above

Checklist

@guswynn guswynn force-pushed the maintained-statistics branch from 86c4b65 to f424da1 Compare February 14, 2024 01:19
@guswynn guswynn self-assigned this Feb 14, 2024
@guswynn guswynn mentioned this pull request Feb 14, 2024
5 tasks
@guswynn guswynn force-pushed the maintained-statistics branch from f424da1 to 5b291dd Compare February 14, 2024 22:06
@guswynn guswynn marked this pull request as ready for review February 14, 2024 23:59
@guswynn guswynn requested review from a team February 14, 2024 23:59
@guswynn guswynn requested review from a team and morsapaes as code owners February 14, 2024 23:59
@guswynn guswynn requested a review from ParkMyCar February 14, 2024 23:59
Copy link

shepherdlybot bot commented Feb 15, 2024

Risk Score:83 / 100 Bug Hotspots:2 Resilience Coverage:50%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability
  • (Required) QA Review 🔍 Detected
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request poses a high risk with a score of 83, influenced by factors such as the average line count and executable lines within files. It's noteworthy that historically, pull requests with these characteristics are 140% more likely to introduce bugs compared to the repository baseline. Additionally, the changes involve 2 files that have recently seen a high number of bug fixes, further contributing to the risk. The repository's observed bug trend is currently increasing, although this is not directly tied to the risk score.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../session/vars.rs 94
../src/lib.rs 98

@nrainer-materialize
Copy link
Contributor

I triggered a nightly and coverage build; I will report the results when they are ready.

@nrainer-materialize
Copy link
Contributor

I triggered a nightly and coverage build; I will report the results when they are ready.

Nightly looks good. I will annotate coverage results inline.

src/storage-client/src/statistics.rs Outdated Show resolved Hide resolved
src/storage-client/src/statistics.rs Outdated Show resolved Hide resolved
src/storage-client/src/statistics.rs Outdated Show resolved Hide resolved
@guswynn
Copy link
Contributor Author

guswynn commented Feb 15, 2024

@nrainer-materialize some things won't be covered until the new metrics are actually hooked up; they are all just 0/NULL now!

Rest of nightlies look good!!

@guswynn
Copy link
Contributor Author

guswynn commented Feb 15, 2024

@petrosagg this is rebased on the pr that did the code movement; it also has new commits that change all the names to the ones decided on in: https://materializeinc.slack.com/archives/C0637RN7PKQ/p1707975829845539?thread_ts=1707956060.532079&cid=C0637RN7PKQ

It should be reviewable as 1 big pr, after familiarizing yourself with the main changes that are being made

@guswynn guswynn force-pushed the maintained-statistics branch 2 times, most recently from c15a7e7 to 90acd55 Compare February 15, 2024 23:27
src/storage-controller/src/lib.rs Outdated Show resolved Hide resolved
src/storage-controller/src/lib.rs Outdated Show resolved Hide resolved
/// Spawns a task that continually (at an interval) writes statistics from storaged's
/// that are consolidated in shared memory in the controller.
pub(super) fn spawn_statistics_scraper<Stats, T>(
statistics_collection_id: GlobalId,
collection_mgmt: CollectionManager<T>,
shared_stats: Arc<Mutex<BTreeMap<GlobalId, StatsInitState<Stats>>>>,
shared_stats: Arc<Mutex<BTreeMap<GlobalId, Option<Stats>>>>,
previous_values: Vec<Row>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail to will reconcile correctly because these previous values are used in a blind append_to_collection below which does not assert that the current upper of the shard is the upper that was observed during the snapshot of the previous values. This means you might append the wrong update because the upper moved from underneath you.

To fix this the scraper needs to be able to obtain a snapshot and request that data is inserted at a specific point in time, for example by using a compare_and_append directly. If you don't want to fix this in this PR we should open an issue to track it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill open an issue!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/testdrive/catalog.td Outdated Show resolved Hide resolved
test/testdrive/catalog.td Outdated Show resolved Hide resolved
src/storage-client/src/statistics.rs Outdated Show resolved Hide resolved
src/storage-client/src/statistics.rs Outdated Show resolved Hide resolved
src/storage/src/statistics.rs Show resolved Hide resolved
src/storage/src/statistics.rs Outdated Show resolved Hide resolved
src/storage/src/statistics.rs Show resolved Hide resolved
@guswynn guswynn force-pushed the maintained-statistics branch from 90acd55 to 43d02e3 Compare February 16, 2024 22:07
@guswynn guswynn force-pushed the maintained-statistics branch from 43d02e3 to bc34e32 Compare February 17, 2024 00:03
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The few adapter bits here LGTM

@guswynn guswynn merged commit e1e595a into MaterializeInc:main Feb 20, 2024
71 checks passed
@guswynn guswynn deleted the maintained-statistics branch February 20, 2024 17:41
@bosconi bosconi added the A-storage Area: storage label Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants