Easily identify stream graph bottleneck #13481

kwannoel · 2023-11-16T13:59:10Z

@BugenZhao
We may investigate if it's possible to inspect the back-pressure metrics next time so we can find which fragment is slow at a single glance. :lark-cry: Await-tree has bear too much...
#11696

kwannoel · 2023-11-16T14:05:01Z

Oops, seem like @BugenZhao is already working on it:

I’m currently investigating the flow chart in Grafana: https://github.com/rekha86sundararajan/grafana-diagram

BugenZhao · 2023-11-16T14:14:44Z

Yeah, it's at #13422 but I'm encountering some problems..

github-actions · 2024-02-03T01:47:31Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

kwannoel · 2024-08-15T13:38:19Z

I think the dashboard based solution is kind of inadequate for cluster with large number of stream jobs. I found the backpressure graph takes too long to load. Further, the dashboard approach is not user facing. I would like some form of metrics that the cloud side can scrape, and that we can display in our dev-dashboard.

Btw this is not to diminish the great work that @BugenZhao has done 🙌 , it is very useful for diagnose bottleneck for small to medium size clusters, and also to give an idea of how we can identify bottlenecks in the stream graph in an automated way, and potentially how we can also visualize it in for users.

Feel free to chime in, if perhaps there's some way we can repurpose our existing solutions.

BugenZhao · 2024-08-16T03:30:54Z

I found the backpressure graph takes too long to load.

For this specific point, I think #17510 can help.

fuyufjh · 2024-08-16T03:37:42Z

I would like some form of metrics that the cloud side can scrape, and that we can display in our dev-dashboard.

Bugen tried to place a DAG graph in Grafana board, but it turned to be hard because Grafana doesn't provide a way to dynamically generate an DAG diagram. Due to this, perhaps our kernel dashboard is the best way to achieve this goal.

the dashboard approach is not user facing.

Currently yes, but I don't think it's a major problem. As long as it works well, we can port it into the Cloud portal quickly. What we need now is to prove this to be an efficient way for users to self-diagnosis.

kwannoel · 2024-08-16T04:06:29Z

As long as it works well, we can port it into the Cloud portal quickly.

Oh that's true. Underneath it's just a bunch of API calls to meta service.

kwannoel · 2024-08-16T04:11:52Z

The way forward:

dashboard: fetch fragment graph for each streaming job on demand #17510 to optimize the graph.
Start using it for bigger clusters more instead of await-tree.
If it works well, port it to cloud portal.

I slightly prefer visualizing it as a table (listing the bottleneck MVs and specific actors), since when DAG is really large, it may be hard to navigate and see the bottleneck MV. But I think that could be an implementation detail. Since now that we have a DAG, we can traverse it can come up with the bottleneck spots ourselves.

kwannoel · 2024-08-19T05:17:28Z

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

BugenZhao · 2024-08-19T05:54:19Z

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

True. With Prometheus approach we're able to retrieve historical statistics. That's also one of reasons I chose Grafana to visualize the back-pressure graph in the very first attempt (#13422).

kwannoel · 2024-08-19T07:28:34Z

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

True. With Prometheus approach we're able to retrieve historical statistics. That's also one of reasons I chose Grafana to visualize the back-pressure graph in the very first attempt (#13422).

Actually to put the graph is grafana is sufficient. Because we can view the graph at a specific point-in-time.

kwannoel · 2024-08-20T01:59:59Z

Based on our discussions here's my thoughts on the approach:

Polish up the existing solution first, since it's what we have.
a. Show the MV-level bottleneck graph, so users can trace which MV is causing the backpressure.
b. Allow users to select a specific bottleneck mv, zoom into its fragment graph.
c. Change the dashboard bottleneck graph to use prometheus metrics, and support historical data querying. We don't need it to be on grafana, as long the data source supports historical data querying, that's good enough.
Port this existing solution over to the cloud portal. Users can self-service and use this feature.

fuyufjh · 2024-08-21T07:32:15Z

An idea: Shall we show the current epoch for each actor in the DAG graph? Particularly, we may highlight the actors that are in the lowest epoch.

This is an useful technique when we use await-tree to find bottleneck. Besides, it doesn't need Promethues because the actor's epoch can be found in task_local or LocalBarrierManager.

kwannoel · 2024-08-21T09:51:08Z

An idea: Shall we show the current epoch for each actor? Particularly, we may highlight the actors that are in the lowest epoch.

This is an useful technique when we use await-tree to find bottleneck. Besides, it doesn't need Promethues because the actor's epoch can be found in task_local or LocalBarrierManager.

There can be a case where upstream and downstream actors are in the same epoch, but the data between them is backpressured due to the downstream actor being a bottleneck.

In such a case it seems only the info about actor blocking or await-tree can let us know what is the bottleneck.

fuyufjh · 2024-08-21T09:59:15Z

There can be a case where upstream and downstream actors are in the same epoch, but the data between them is backpressured due to the downstream actor being a bottleneck.

In such a case it seems only the info about actor blocking or await-tree can let us know what is the bottleneck.

Yeah, I guess both of these information can lead to the same conclusion - which actor/fragment/MV is the bottleneck. Basically, there will be a subgraph in the entire graph that has low epoch number and high backpressure.

fuyufjh · 2024-08-21T18:58:59Z

Another thing worth to mention: sometimes the backpressure metrics can't directly point out the real cause of the problem.

For example, a 1000x join amplification may happen at actor A,

 X ---\
       A ==> B ===> [C]
 Y ---/

If B is simple, the 1000x traffic may not cause any issue. It might be until a further downstream actor C, that has some complicated window function logic, and finally become a bottleneck.

To address the problem correctly, I think throughput is another important metrics. We had better show both of these info in the visualization graph.

kwannoel · 2024-08-23T07:04:51Z

If the duration the actor blocks is longer than the interval at which we poll the meta service, it will currently show up as no backpressure in the dashboard. That's because when we poll, because the actor is still blocking, there's no change in the blocking duration metrics. Only when the actor has output the chunk downstream it will increment the blocking duration metrics.

see:

risingwave/src/meta/src/dashboard/prometheus.rs

Line 150 in 4d0a201

    
           format!("avg(rate(stream_actor_output_buffer_blocking_duration_ns{{{}}}[60s])) by (fragment_id, downstream_fragment_id) / 1000000000", srv.prometheus_selector);

see:

risingwave/src/stream/src/executor/dispatch.rs

Lines 126 to 130 in 4d0a201

    
           let start_time = Instant::now(); 
        
           dispatcher.dispatch_data(chunk.clone()).await?; 
        
           dispatcher 
        
               .actor_output_buffer_blocking_duration_ns 
        
               .inc_by(start_time.elapsed().as_nanos() as u64);

When stream graph is congested, we often see in the await tree that an actor can be blocked for several minutes. So this issue can definitely occur. #18215 is a fall-back solution, needs further investigation.

#18219 can mitigate it, as long as the blocking duration is not too long.

kwannoel · 2024-10-08T08:04:12Z

Remaining task is: #18215. We can close this as the dashboard work is done.

kwannoel added the type/feature label Nov 16, 2023

kwannoel self-assigned this Nov 16, 2023

github-actions bot added this to the release-1.5 milestone Nov 16, 2023

kwannoel assigned BugenZhao and unassigned kwannoel Nov 16, 2023

kwannoel removed this from the release-1.5 milestone Dec 4, 2023

BugenZhao mentioned this issue Jan 9, 2024

feat(dashboard): display back-pressure on fragment graph #14446

Merged

4 tasks

github-actions bot added the no-issue-activity label Feb 3, 2024

kwannoel added this to the future-release-1.12 milestone Aug 15, 2024

kwannoel mentioned this issue Aug 16, 2024

Tracking: faster cluster troubleshooting and recovery #18058

Open

2 tasks

This comment was marked as resolved.

Sign in to view

xxchan mentioned this issue Aug 19, 2024

metrics: add mv id/name to backpressure #17994

Open

kwannoel assigned kwannoel and unassigned BugenZhao Aug 20, 2024

kwannoel mentioned this issue Aug 23, 2024

Tracking: Visualize stream graph bottleneck #18176

Closed

11 tasks

kwannoel closed this as completed Oct 8, 2024

This was referenced Dec 2, 2024

feat(metrics): add internal latency of actors, MVs and sinks #19639

Merged

dashboard: show epoch on DAG #19652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easily identify stream graph bottleneck #13481

Easily identify stream graph bottleneck #13481

kwannoel commented Nov 16, 2023

kwannoel commented Nov 16, 2023

BugenZhao commented Nov 16, 2023

github-actions bot commented Feb 3, 2024

kwannoel commented Aug 15, 2024

BugenZhao commented Aug 16, 2024

fuyufjh commented Aug 16, 2024

This comment was marked as resolved.

kwannoel commented Aug 16, 2024 •

edited

Loading

kwannoel commented Aug 16, 2024 •

edited

Loading

kwannoel commented Aug 19, 2024 •

edited

Loading

BugenZhao commented Aug 19, 2024 •

edited

Loading

kwannoel commented Aug 19, 2024

kwannoel commented Aug 20, 2024 •

edited

Loading

fuyufjh commented Aug 21, 2024 •

edited

Loading

kwannoel commented Aug 21, 2024

fuyufjh commented Aug 21, 2024

fuyufjh commented Aug 21, 2024

kwannoel commented Aug 23, 2024 •

edited

Loading

kwannoel commented Oct 8, 2024

Easily identify stream graph bottleneck #13481

Easily identify stream graph bottleneck #13481

Comments

kwannoel commented Nov 16, 2023

kwannoel commented Nov 16, 2023

BugenZhao commented Nov 16, 2023

github-actions bot commented Feb 3, 2024

kwannoel commented Aug 15, 2024

BugenZhao commented Aug 16, 2024

fuyufjh commented Aug 16, 2024

This comment was marked as resolved.

kwannoel commented Aug 16, 2024 • edited Loading

kwannoel commented Aug 16, 2024 • edited Loading

kwannoel commented Aug 19, 2024 • edited Loading

BugenZhao commented Aug 19, 2024 • edited Loading

kwannoel commented Aug 19, 2024

kwannoel commented Aug 20, 2024 • edited Loading

fuyufjh commented Aug 21, 2024 • edited Loading

kwannoel commented Aug 21, 2024

fuyufjh commented Aug 21, 2024

fuyufjh commented Aug 21, 2024

kwannoel commented Aug 23, 2024 • edited Loading

kwannoel commented Oct 8, 2024

kwannoel commented Aug 16, 2024 •

edited

Loading

kwannoel commented Aug 16, 2024 •

edited

Loading

kwannoel commented Aug 19, 2024 •

edited

Loading

BugenZhao commented Aug 19, 2024 •

edited

Loading

kwannoel commented Aug 20, 2024 •

edited

Loading

fuyufjh commented Aug 21, 2024 •

edited

Loading

kwannoel commented Aug 23, 2024 •

edited

Loading