Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(metrics): remove actor_id label from back-pressure metrics based on metrics level #18213

Merged
merged 8 commits into from
Aug 29, 2024

Conversation

BugenZhao
Copy link
Member

Signed-off-by: Bugen Zhao [email protected]I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

See background in #18108. Resolve #18062.

Following the idea described in #18108 (comment), i.e., split avg into sum and count, this PR reduces the cardinality of the back-pressure metrics stream_actor_output_buffer_blocking_duration_ns by masking the actor_id label if the metrics level is less verbose than Debug. This is achieved by utilizing RelabeledMetricVec.

Tested locally and the results are the same.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Comment on lines 308 to 314
let dispatcher_count = register_guarded_int_gauge_vec_with_registry!(
"stream_dispatcher_count",
"Total number of dispatchers",
&["fragment_id", "downstream_fragment_id"],
registry
)
.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is exactly equal to the parallelism (i.e. number of actors) of this fragment, no need to introduce a new metric.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right as we currently don't support multiple-edges between two fragments. But is there existing metrics for "parallelism"? 👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is there existing metrics for "parallelism"? 👀

We can retrieve this from metrics actor_info of which labels are "actor_id", "fragment_id", "compute_node", but is this also an actor-level metrics with high cardinality? 🥵🥵

Copy link
Member

@fuyufjh fuyufjh Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Let's introduce a new metrics for "parallelism (i.e. number of actors) of this fragment"

Meanwhile, for actor_info, I recommend that we can keep it as is now and focus on other actor-level metrics.

Signed-off-by: Bugen Zhao <[email protected]>
@BugenZhao BugenZhao requested a review from kwannoel August 28, 2024 09:43
@kwannoel
Copy link
Contributor

Can we verify that the amount of metrics ingested by prometheus is in fact less than before, after this PR?

Signed-off-by: Bugen Zhao <[email protected]>
@BugenZhao
Copy link
Member Author

Can we verify that the amount of metrics ingested by prometheus is in fact less than before, after this PR?

Sure. This is what the metrics look like when metrics_level is set to debug:

# HELP stream_actor_output_buffer_blocking_duration_ns Total blocking duration (ns) of output buffer
# TYPE stream_actor_output_buffer_blocking_duration_ns counter
stream_actor_output_buffer_blocking_duration_ns{actor_id="10",downstream_fragment_id="1",fragment_id="2"} 192042
stream_actor_output_buffer_blocking_duration_ns{actor_id="11",downstream_fragment_id="1",fragment_id="2"} 2252220657
stream_actor_output_buffer_blocking_duration_ns{actor_id="12",downstream_fragment_id="1",fragment_id="2"} 1167584
stream_actor_output_buffer_blocking_duration_ns{actor_id="17",downstream_fragment_id="2",fragment_id="3"} 5269707750
stream_actor_output_buffer_blocking_duration_ns{actor_id="18",downstream_fragment_id="2",fragment_id="3"} 5389916
stream_actor_output_buffer_blocking_duration_ns{actor_id="19",downstream_fragment_id="2",fragment_id="3"} 5196666
stream_actor_output_buffer_blocking_duration_ns{actor_id="20",downstream_fragment_id="2",fragment_id="3"} 4986791
stream_actor_output_buffer_blocking_duration_ns{actor_id="25",downstream_fragment_id="2",fragment_id="5"} 4126665207
stream_actor_output_buffer_blocking_duration_ns{actor_id="26",downstream_fragment_id="2",fragment_id="5"} 4813958
stream_actor_output_buffer_blocking_duration_ns{actor_id="27",downstream_fragment_id="2",fragment_id="5"} 4993415
stream_actor_output_buffer_blocking_duration_ns{actor_id="28",downstream_fragment_id="2",fragment_id="5"} 5631874
stream_actor_output_buffer_blocking_duration_ns{actor_id="33",downstream_fragment_id="3",fragment_id="4"} 968528994
stream_actor_output_buffer_blocking_duration_ns{actor_id="33",downstream_fragment_id="5",fragment_id="4"} 2211583543
stream_actor_output_buffer_blocking_duration_ns{actor_id="34",downstream_fragment_id="3",fragment_id="4"} 516835
stream_actor_output_buffer_blocking_duration_ns{actor_id="34",downstream_fragment_id="5",fragment_id="4"} 2560459
stream_actor_output_buffer_blocking_duration_ns{actor_id="35",downstream_fragment_id="3",fragment_id="4"} 667207
stream_actor_output_buffer_blocking_duration_ns{actor_id="35",downstream_fragment_id="5",fragment_id="4"} 602543
stream_actor_output_buffer_blocking_duration_ns{actor_id="36",downstream_fragment_id="3",fragment_id="4"} 553044
stream_actor_output_buffer_blocking_duration_ns{actor_id="36",downstream_fragment_id="5",fragment_id="4"} 2605332
stream_actor_output_buffer_blocking_duration_ns{actor_id="9",downstream_fragment_id="1",fragment_id="2"} 2575042

While the following is under info level, which is the default in production.

# HELP stream_actor_output_buffer_blocking_duration_ns Total blocking duration (ns) of output buffer
# TYPE stream_actor_output_buffer_blocking_duration_ns counter
stream_actor_output_buffer_blocking_duration_ns{actor_id="",downstream_fragment_id="1",fragment_id="2"} 281855115997
stream_actor_output_buffer_blocking_duration_ns{actor_id="",downstream_fragment_id="2",fragment_id="3"} 983293
stream_actor_output_buffer_blocking_duration_ns{actor_id="",downstream_fragment_id="2",fragment_id="5"} 649668
stream_actor_output_buffer_blocking_duration_ns{actor_id="",downstream_fragment_id="3",fragment_id="4"} 321543
stream_actor_output_buffer_blocking_duration_ns{actor_id="",downstream_fragment_id="5",fragment_id="4"} 3485792

Copy link
Contributor

@kwannoel kwannoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Signed-off-by: Bugen Zhao <[email protected]>
@BugenZhao BugenZhao added this pull request to the merge queue Aug 29, 2024
Merged via the queue into main with commit edb1493 Aug 29, 2024
31 of 33 checks passed
@BugenZhao BugenZhao deleted the bz/remove-actor-label-in-backpressure-metrics branch August 29, 2024 08:51
BugenZhao added a commit that referenced this pull request Aug 30, 2024
github-merge-queue bot pushed a commit that referenced this pull request Sep 2, 2024
…ure metrics based on metrics level (#18213) (#18330)

Signed-off-by: Bugen Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fragment level metrics for actor backpressure (actor input/output blocking time ratio)
3 participants