Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dashboard): add back_pressure_rate UI #14863

Merged
merged 17 commits into from
Feb 26, 2024
Merged

Conversation

yufansong
Copy link
Member

@yufansong yufansong commented Jan 30, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

In #13830, we plan to add a UI page to show page pressure rate without using prometheus. We already add the API in meta to get the current back pressure. In this PR

  1. Add the calculation function to calculatethe bp rate from bp. The interval is 20s. This is a rough calculation. Only call API for every 20 seconds and then divided the 20s when calculate the rate.
  2. Integrate the rate into the Fragement Graph. If select the current, will shift from prometheus data into this API.
  3. What we get from meta node is actor-id, fragement id, next fragement id back pressure. We convert the back pressure into back pressure rate, and also aggregate by fragement id, next fragement id on actor id. Then apply the value in streaming graph.
  4. The reason why not impelement p50, p99 for it: we need to storage and maintain a structure in frontend for back pressure data. It will occupied memory in meta node. If data become large, it will be bad.
image

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 30, 2024

Hmmm, for the UI part, this is duplicated with #14446 (which was a refactor on #11696).

#14446 looks more user-friendly. Can we adapt to that? For example, add a GUI switch or url query paramter to allow it to fetch data from Meta instead of Promethues.

@yufansong yufansong force-pushed the yufan/dashboard-frontend branch from 73d9ad4 to 2772601 Compare January 31, 2024 04:47
@yufansong yufansong force-pushed the yufan/dashboard-frontend branch from b6e6f83 to 9ae89d7 Compare January 31, 2024 05:11
@yufansong
Copy link
Member Author

Hmmm, for the UI part, this is duplicated with #14446 (which was a refactor on #11696).

#14446 looks more user-friendly. Can we adapt to that? For example, add a GUI switch or url query paramter to allow it to fetch data from Meta instead of Promethues.

I add a GUI switch for it. If selecct current, will change to the data from this new API.

}
doFetch()
return () => { }
}, [toast])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument of useEffect - dependencies, tells React to re-run the function once any value of it is changed. Here you pass a toast? Seems to be a misunderstanding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the code style in the codebase.

toast(e, "warning")

I know it will trigger re-run. But in current logic, no parameters suitable for here. So I just put the toast.

Copy link
Member Author

@yufansong yufansong Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do some more investigation. Maybe this is the problem why there is a mismatch on rate between my impelmentation and the previous impelementation. Will remove the toast. This will result multiple thread keep call and calculate the back pressure. Then the interval of the call changed. The final rate become inaccurate.

Copy link
Member

@BugenZhao BugenZhao Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run npm run lint and the linter will help to fix that. Generally all outside variables captured by the closure should be included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/risingwavelabs/risingwave/compare/eric/no_toast

I tried removing it from dependencies and everything seemed to work. cc. @BugenZhao Can you also take a try?

Copy link
Member Author

@yufansong yufansong Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tried to remove toast here, everything still seems to works.

You can run npm run lint and the linter will help to fix that. Generally all outside variables captured by the closure should be included.

I tried npm run lint, and it will remind me to add the dependencies previousBP. I applied the suggestion, but it will cause the function do infinite quickly re-trigger which is not we want here.
So in current logic, the previousBP are not suitable to put into dependencies.

NVM, find a way to refactor it.

@@ -168,8 +169,8 @@ function buildFragmentDependencyAsEdges(

const SIDEBAR_WIDTH = 200

type BackPressureAlgo = "p50" | "p90" | "p95" | "p99"
const backPressureAlgos: BackPressureAlgo[] = ["p50", "p90", "p95", "p99"]
type BackPressureAlgo = "p50" | "p90" | "p95" | "p99" | "current"
Copy link
Member

@fuyufjh fuyufjh Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend to make current the default option. Previously we use Disabled by default because Prometheus is not always available, but now current don't depend on it.

Copy link
Member

@fuyufjh fuyufjh Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems too obscure to users that p50/90/99 is from Prometheus while current is from Meta.

A better way is probably making two ListBox: 1) Data source: Embedded or Prometheus 2) Only shows for Prometheus: average/p50/p99

@fuyufjh
Copy link
Member

fuyufjh commented Feb 1, 2024

I tried locally. The result of Prometheus and the embedded metrics were so different. (I have waited for 1 min before capturing the screenshot). Does it really work? 🤔

current:

image

p50:

image

@yufansong
Copy link
Member Author

yufansong commented Feb 1, 2024

I tried locally. The result of Prometheus and the embedded metrics were so different. (I have waited for 1 min before capturing the screenshot). Does it really work? 🤔

Let me try some large data test locally. Only try some small data before.

dashboard/pages/fragment_graph.tsx Show resolved Hide resolved
dashboard/mock-server.js Outdated Show resolved Hide resolved
dashboard/pages/api/metric.ts Outdated Show resolved Hide resolved
@yufansong
Copy link
Member Author

yufansong commented Feb 2, 2024

@fuyufjh @BugenZhao
Do the following changes:

  1. Change the interval to 5s.
  2. Remove useless code.
  3. Change the UI, add a new list box. Default is Embedded. When selected, on option will show on Back Pressure. When select Prometheus, will show disable | pxx. The reason why keep disable is to remind user select pxx. Otherwise at the beginning, the Ui will default show xx selected but value actually is NULL
image
  1. Refactor the code logic to solve the closure dependencies problem. Already pass lint check.
  2. Fix the BP rate calculation bug. Want to check one thing, in the previous BP API, we average the rate across the actor id. Should we use the AVG or SUM on actor id? Curerntly I do the calculation the same with the AVG logic.
    format!("avg(rate(stream_actor_output_buffer_blocking_duration_ns{{{}}}[60s])) by (fragment_id, downstream_fragment_id) / 1000000000", srv.prometheus_selector);
  3. Do some test between new api and old api. Also welcome more extra checks. The first page show the accurate value from the new API  {'2_1' => 0.0000726248, '3_1' => 0.0000513166} which means fragement 2 -> fragement 1 = 0.0000726248 and fragement 3 -> fragement 1 = 0.0000513166 .
image

We should not compared with the pxx because these value already did some calculation. I directly check the prometheus, their value has some gap, but I think the majority is similiar and can be accepted :

image image

dashboard/pages/fragment_graph.tsx Outdated Show resolved Hide resolved
dashboard/pages/fragment_graph.tsx Outdated Show resolved Hide resolved
dashboard/pages/fragment_graph.tsx Outdated Show resolved Hide resolved
dashboard/pages/fragment_graph.tsx Outdated Show resolved Hide resolved
dashboard/pages/fragment_graph.tsx Outdated Show resolved Hide resolved
@fuyufjh
Copy link
Member

fuyufjh commented Feb 21, 2024

5. Fix the BP rate calculation bug. Want to check one thing, in the previous BP API, we average the rate across the actor id. Should we use the AVG or SUM on actor id? Curerntly I do the calculation the same with the AVG logic.

Are you asking that whether we should show back-pressure on actor-level or fragment-level granularity? I think actor-level information is useless and too verbose in most cases. The only cases I can think of is inbalance i.e. data skew, this happens sometimes, so it would be nice if we allow users to click to dive into actor-level information in the dashabord.

Copy link

gitguardian bot commented Feb 21, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9425213 Triggered Generic Password fbc0e59 ci/scripts/regress-test.sh View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@yufansong
Copy link
Member Author

yufansong commented Feb 21, 2024

  1. Fix the BP rate calculation bug. Want to check one thing, in the previous BP API, we average the rate across the actor id. Should we use the AVG or SUM on actor id? Curerntly I do the calculation the same with the AVG logic.

Are you asking that whether we should show back-pressure on actor-level or fragment-level granularity? I think actor-level information is useless and too verbose in most cases. The only cases I can think of is inbalance i.e. data skew, this happens sometimes, so it would be nice if we allow users to click to dive into actor-level information in the dashabord.

Oh, I mean this, in old api, it directly use the following code to get metrics from prometheus:

format!("avg(rate(stream_actor_output_buffer_blocking_duration_ns{{{}}}[60s])) by (fragment_id, downstream_fragment_id) / 1000000000", srv.prometheus_selector);

My understand is it will do average on actor id, and group by fragment_id, downstream_fragment_id. I implememnt the new api following this old logic. Just want to confirm Average logic is what we want.

dashboard/pages/api/metric.ts Outdated Show resolved Hide resolved
dashboard/pages/api/metric.ts Outdated Show resolved Hide resolved
dashboard/pages/api/metric.ts Show resolved Hide resolved
@yufansong
Copy link
Member Author

Fix the bug. In previous time, the back pressure rate calculation is correct. But when drawing graph in frontend, there is a bug, forget to convert the back pressure rate into the value used for drawing. Here is the example of embeded and prometheus under the large workload:

image
image

@yufansong
Copy link
Member Author

yufansong commented Feb 23, 2024

BTW, I find some compile problem when marge the latest main branch. I find the problem also exist in the latest main branch which not include my PR. So it should not introduced in this PR. Record it in #15219 , I guess it should related to the recent merged PRs in past 1 month.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the rest

dashboard/pages/api/metric.ts Outdated Show resolved Hide resolved

// Get back pressure from meta node -> compute node
export async function getBackPressureWithoutPrometheus() {
const response = await api.get("/metrics/back_pressures")
Copy link
Member

@fuyufjh fuyufjh Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move it to another path such as "embedded_metrics" or sth? It's mixed with the metrics fetched from Prometheus, which looks confusing to me.

.route("/metrics/cluster", get(prometheus::list_prometheus_cluster))
.route(
"/metrics/actor/back_pressures",
get(prometheus::list_prometheus_actor_back_pressure),
)
.route("/metrics/back_pressures", get(get_back_pressure))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, /metrics/actor/back_pressures actually output fragment-level metrics instead of actor-level... 😂 Legacy issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will solve the problem in a seperate PR

@yufansong yufansong added this pull request to the merge queue Feb 26, 2024
Merged via the queue into main with commit 5f185d9 Feb 26, 2024
7 of 9 checks passed
@yufansong yufansong deleted the yufan/dashboard-frontend branch February 26, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants