Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dashboard): visualize average backpressure rather than spot backpressure #18219

Merged
merged 4 commits into from
Aug 24, 2024

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented Aug 23, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

See the demo: https://www.notion.so/risingwave-labs/Backpressure-Graph-improvements-a11b0c5c74d54202922acb071aca72a0?pvs=4

Currently dashboard calculates the backpressure with the following mechanism.

  1. Poll the actor_buffer_output_blocking_ns metric from meta / prometheus at a fixed duration (5s).
  2. After polling, get the difference between previous and current blocking_duration.
  3. Then backpressure rate is calculated as:
delta_blocking_duration / fixed_duration
  1. This means if the blocking duration is a few minutes, it will show up on the graph for 5s, but in the subsequent epoch, it shows up 0, since actor_buffer_output_blocking_ns will ONLY be incremented after the chunk has been yielded downstream.
  2. It is very easy to miss as a result, and does not reflect the stuck state of the actor. This becomes worse if the stream job contains a large number of fragments.

We can just compute average BP to solve this problem. cr @fuyufjh.

Further PRs:
see: #18176

Other changes

I changed the suggested steps to run the dashboard in the readme, so there's actually some live workload running. Previously it's just a series of DDLs with no data running through it.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@kwannoel kwannoel changed the title feat(dashboard): spill blocking duration if there's excess feat(dashboard): spill blocking duration if there's excess when calculating backpressure Aug 23, 2024
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation is obscure to explain or understand. As a monitoring component, I'd like to keep it straight-forward.

I am thinking about a simpler way to address the issue. Instead of use the diff of delta_blocking_duration time, how about just use current value - first value, which means the accumulated one since the web page opens. In other words, the web page shows the average backpressure of 0-5s, 0-10s, 0-15s, ... as the user staying in the page. When severe backpressure is happening, the longer the user stay, the average number will be more accurate.

Your approach is also acceptable to me, but it's a bit hard to understand. Particularly, the duration spilled here is from a collected barrier i.e. it happened in the past, but it's spilled into the future.

Approved. Please pick the one you like the most. :)

@kwannoel
Copy link
Contributor Author

The calculation is obscure to explain or understand. As a monitoring component, I'd like to keep it straight-forward.

I am thinking about a simpler way to address the issue. Instead of use the diff of delta_blocking_duration time, how about just use current value - first value, which means the accumulated one since the web page opens. In other words, the web page shows the average backpressure of 0-5s, 0-10s, 0-15s, ... as the user staying in the page. When severe backpressure is happening, the longer the user stay, the average number will be more accurate.

Your approach is also acceptable to me, but it's a bit hard to understand. Particularly, the duration spilled here is from a collected barrier i.e. it happened in the past, but it's spilled into the future.

Approved. Please pick the one you like the most. :)

Average BP makes sense to me. That's a very good solution, because of this:

When severe backpressure is happening, the longer the user stay, the average number will be more accurate.

Let me try it.

@kwannoel kwannoel force-pushed the kwannoel/bottleneck branch from 715ef63 to 979219f Compare August 23, 2024 15:55
@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 23, 2024

Screenshot 2024-08-23 at 11 56 24 PM

Stays like this for a prolonged period, which easily identifies the bottleneck of the cluster. This approach works well.

@kwannoel kwannoel changed the title feat(dashboard): spill blocking duration if there's excess when calculating backpressure feat(dashboard): visualize average backpressure rather than spot backpressure Aug 23, 2024
@kwannoel kwannoel added this pull request to the merge queue Aug 24, 2024
Merged via the queue into main with commit 7009743 Aug 24, 2024
34 of 35 checks passed
@kwannoel kwannoel deleted the kwannoel/bottleneck branch August 24, 2024 08:48
github-merge-queue bot pushed a commit that referenced this pull request Aug 27, 2024
BugenZhao added a commit that referenced this pull request Aug 28, 2024
Signed-off-by: Bugen Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants