-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: maintain source and sink statistics through restarts #25218
Conversation
86c4b65
to
f424da1
Compare
f424da1
to
5b291dd
Compare
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The pull request poses a high risk with a score of 83, influenced by factors such as the average line count and executable lines within files. It's noteworthy that historically, pull requests with these characteristics are 140% more likely to introduce bugs compared to the repository baseline. Additionally, the changes involve 2 files that have recently seen a high number of bug fixes, further contributing to the risk. The repository's observed bug trend is currently increasing, although this is not directly tied to the risk score. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
@nrainer-materialize some things won't be covered until the new metrics are actually hooked up; they are all just 0/NULL now! Rest of nightlies look good!! |
5b291dd
to
08b07d4
Compare
@petrosagg this is rebased on the pr that did the code movement; it also has new commits that change all the names to the ones decided on in: https://materializeinc.slack.com/archives/C0637RN7PKQ/p1707975829845539?thread_ts=1707956060.532079&cid=C0637RN7PKQ It should be reviewable as 1 big pr, after familiarizing yourself with the main changes that are being made |
c15a7e7
to
90acd55
Compare
/// Spawns a task that continually (at an interval) writes statistics from storaged's | ||
/// that are consolidated in shared memory in the controller. | ||
pub(super) fn spawn_statistics_scraper<Stats, T>( | ||
statistics_collection_id: GlobalId, | ||
collection_mgmt: CollectionManager<T>, | ||
shared_stats: Arc<Mutex<BTreeMap<GlobalId, StatsInitState<Stats>>>>, | ||
shared_stats: Arc<Mutex<BTreeMap<GlobalId, Option<Stats>>>>, | ||
previous_values: Vec<Row>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can fail to will reconcile correctly because these previous values are used in a blind append_to_collection
below which does not assert that the current upper of the shard is the upper that was observed during the snapshot of the previous values. This means you might append the wrong update because the upper moved from underneath you.
To fix this the scraper needs to be able to obtain a snapshot and request that data is inserted at a specific point in time, for example by using a compare_and_append
directly. If you don't want to fix this in this PR we should open an issue to track it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill open an issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken out here: https://github.com/MaterializeInc/materialize/issues/25349 !
… desc, and clarify the differences between the columns
90acd55
to
43d02e3
Compare
43d02e3
to
bc34e32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The few adapter bits here LGTM
This pr change source and sink statistics to avoid counters/gauges being reset when mz/replicas/sources/sinks are restarted. The motivation behind this is two-fold:
At its core, its not a huge change, but in practice, it requires a lot of adjustments and code movement to get it right. This pr has many commits, but most are straightforward (tests/code movement/small changes). The two main commits are:
_per_worker
->_raw
7d5735e
(#25218)The former adjusts the
_statistics_per_worker
collections to_statistics_raw
. Because we now need to maintain counters and gauges through changes in the source, and replicas can change their number of workers, its no longer reasonably feasible to hold into per-worker statistics. We lose history whenever we change this schema, so I want to only do this once, which is why I added the new statistics columns, despite them not yet being hooked up. This is the pr to bikeshed the naming and schema of these columns, so we avoid churn in the future.Note also that we rely on this schema change to be able to safely unpack statistics on envd restart.
The core changes
baa7da7
(#25218)This commit contains the core changes. All working in concert:
mz_storage_client::statistics
.Motivation
This PR adds a known-desirable feature.
This PR refactors existing code.
Tips for reviewer
see above
Checklist
https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20240108_source_metrics_2.md
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.