Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easily identify stream graph bottleneck #13481

Closed
kwannoel opened this issue Nov 16, 2023 · 19 comments
Closed

Easily identify stream graph bottleneck #13481

kwannoel opened this issue Nov 16, 2023 · 19 comments

Comments

@kwannoel
Copy link
Contributor

@BugenZhao
We may investigate if it's possible to inspect the back-pressure metrics next time so we can find which fragment is slow at a single glance. :lark-cry: Await-tree has bear too much...
#11696

@kwannoel kwannoel self-assigned this Nov 16, 2023
@github-actions github-actions bot added this to the release-1.5 milestone Nov 16, 2023
@kwannoel kwannoel assigned BugenZhao and unassigned kwannoel Nov 16, 2023
@kwannoel
Copy link
Contributor Author

Oops, seem like @BugenZhao is already working on it:

I’m currently investigating the flow chart in Grafana: https://github.com/rekha86sundararajan/grafana-diagram

@BugenZhao
Copy link
Member

Yeah, it's at #13422 but I'm encountering some problems..

Copy link
Contributor

github-actions bot commented Feb 3, 2024

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@kwannoel
Copy link
Contributor Author

I think the dashboard based solution is kind of inadequate for cluster with large number of stream jobs. I found the backpressure graph takes too long to load. Further, the dashboard approach is not user facing. I would like some form of metrics that the cloud side can scrape, and that we can display in our dev-dashboard.

Btw this is not to diminish the great work that @BugenZhao has done 🙌 , it is very useful for diagnose bottleneck for small to medium size clusters, and also to give an idea of how we can identify bottlenecks in the stream graph in an automated way, and potentially how we can also visualize it in for users.

Feel free to chime in, if perhaps there's some way we can repurpose our existing solutions.

@BugenZhao
Copy link
Member

I found the backpressure graph takes too long to load.

For this specific point, I think #17510 can help.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 16, 2024

I would like some form of metrics that the cloud side can scrape, and that we can display in our dev-dashboard.

Bugen tried to place a DAG graph in Grafana board, but it turned to be hard because Grafana doesn't provide a way to dynamically generate an DAG diagram. Due to this, perhaps our kernel dashboard is the best way to achieve this goal.

the dashboard approach is not user facing.

Currently yes, but I don't think it's a major problem. As long as it works well, we can port it into the Cloud portal quickly. What we need now is to prove this to be an efficient way for users to self-diagnosis.

@kwannoel

This comment was marked as resolved.

@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 16, 2024

As long as it works well, we can port it into the Cloud portal quickly.

Oh that's true. Underneath it's just a bunch of API calls to meta service.

@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 16, 2024

The way forward:

  1. dashboard: fetch fragment graph for each streaming job on demand #17510 to optimize the graph.
  2. Start using it for bigger clusters more instead of await-tree.
  3. If it works well, port it to cloud portal.

I slightly prefer visualizing it as a table (listing the bottleneck MVs and specific actors), since when DAG is really large, it may be hard to navigate and see the bottleneck MV. But I think that could be an implementation detail. Since now that we have a DAG, we can traverse it can come up with the bottleneck spots ourselves.

@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 19, 2024

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

@BugenZhao
Copy link
Member

BugenZhao commented Aug 19, 2024

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

True. With Prometheus approach we're able to retrieve historical statistics. That's also one of reasons I chose Grafana to visualize the back-pressure graph in the very first attempt (#13422).

@kwannoel
Copy link
Contributor Author

Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck.

True. With Prometheus approach we're able to retrieve historical statistics. That's also one of reasons I chose Grafana to visualize the back-pressure graph in the very first attempt (#13422).

Actually to put the graph is grafana is sufficient. Because we can view the graph at a specific point-in-time.

@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 20, 2024

Based on our discussions here's my thoughts on the approach:

  1. Polish up the existing solution first, since it's what we have.
    a. Show the MV-level bottleneck graph, so users can trace which MV is causing the backpressure.
    b. Allow users to select a specific bottleneck mv, zoom into its fragment graph.
    c. Change the dashboard bottleneck graph to use prometheus metrics, and support historical data querying. We don't need it to be on grafana, as long the data source supports historical data querying, that's good enough.
  2. Port this existing solution over to the cloud portal. Users can self-service and use this feature.

@kwannoel kwannoel assigned kwannoel and unassigned BugenZhao Aug 20, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Aug 21, 2024

An idea: Shall we show the current epoch for each actor in the DAG graph? Particularly, we may highlight the actors that are in the lowest epoch.

This is an useful technique when we use await-tree to find bottleneck. Besides, it doesn't need Promethues because the actor's epoch can be found in task_local or LocalBarrierManager.

@kwannoel
Copy link
Contributor Author

An idea: Shall we show the current epoch for each actor? Particularly, we may highlight the actors that are in the lowest epoch.

This is an useful technique when we use await-tree to find bottleneck. Besides, it doesn't need Promethues because the actor's epoch can be found in task_local or LocalBarrierManager.

There can be a case where upstream and downstream actors are in the same epoch, but the data between them is backpressured due to the downstream actor being a bottleneck.

In such a case it seems only the info about actor blocking or await-tree can let us know what is the bottleneck.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 21, 2024

There can be a case where upstream and downstream actors are in the same epoch, but the data between them is backpressured due to the downstream actor being a bottleneck.

In such a case it seems only the info about actor blocking or await-tree can let us know what is the bottleneck.

Yeah, I guess both of these information can lead to the same conclusion - which actor/fragment/MV is the bottleneck. Basically, there will be a subgraph in the entire graph that has low epoch number and high backpressure.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 21, 2024

Another thing worth to mention: sometimes the backpressure metrics can't directly point out the real cause of the problem.

For example, a 1000x join amplification may happen at actor A,

 X ---\
       A ==> B ===> [C]
 Y ---/

If B is simple, the 1000x traffic may not cause any issue. It might be until a further downstream actor C, that has some complicated window function logic, and finally become a bottleneck.

To address the problem correctly, I think throughput is another important metrics. We had better show both of these info in the visualization graph.

@kwannoel
Copy link
Contributor Author

kwannoel commented Aug 23, 2024

If the duration the actor blocks is longer than the interval at which we poll the meta service, it will currently show up as no backpressure in the dashboard. That's because when we poll, because the actor is still blocking, there's no change in the blocking duration metrics. Only when the actor has output the chunk downstream it will increment the blocking duration metrics.

see:

format!("avg(rate(stream_actor_output_buffer_blocking_duration_ns{{{}}}[60s])) by (fragment_id, downstream_fragment_id) / 1000000000", srv.prometheus_selector);

see:
let start_time = Instant::now();
dispatcher.dispatch_data(chunk.clone()).await?;
dispatcher
.actor_output_buffer_blocking_duration_ns
.inc_by(start_time.elapsed().as_nanos() as u64);

When stream graph is congested, we often see in the await tree that an actor can be blocked for several minutes. So this issue can definitely occur. #18215 is a fall-back solution, needs further investigation.

#18219 can mitigate it, as long as the blocking duration is not too long.

@kwannoel
Copy link
Contributor Author

kwannoel commented Oct 8, 2024

Remaining task is: #18215. We can close this as the dashboard work is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants