-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easily identify stream graph bottleneck #13481
Comments
Oops, seem like @BugenZhao is already working on it:
|
Yeah, it's at #13422 but I'm encountering some problems.. |
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned. |
I think the dashboard based solution is kind of inadequate for cluster with large number of stream jobs. I found the backpressure graph takes too long to load. Further, the dashboard approach is not user facing. I would like some form of metrics that the cloud side can scrape, and that we can display in our dev-dashboard. Btw this is not to diminish the great work that @BugenZhao has done 🙌 , it is very useful for diagnose bottleneck for small to medium size clusters, and also to give an idea of how we can identify bottlenecks in the stream graph in an automated way, and potentially how we can also visualize it in for users. Feel free to chime in, if perhaps there's some way we can repurpose our existing solutions. |
For this specific point, I think #17510 can help. |
Bugen tried to place a DAG graph in Grafana board, but it turned to be hard because Grafana doesn't provide a way to dynamically generate an DAG diagram. Due to this, perhaps our kernel dashboard is the best way to achieve this goal.
Currently yes, but I don't think it's a major problem. As long as it works well, we can port it into the Cloud portal quickly. What we need now is to prove this to be an efficient way for users to self-diagnosis. |
This comment was marked as resolved.
This comment was marked as resolved.
Oh that's true. Underneath it's just a bunch of API calls to meta service. |
The way forward:
I slightly prefer visualizing it as a table (listing the bottleneck MVs and specific actors), since when DAG is really large, it may be hard to navigate and see the bottleneck MV. But I think that could be an implementation detail. Since now that we have a DAG, we can traverse it can come up with the bottleneck spots ourselves. |
Btw, another issue is that we can only check the bottleneck MV at a the current_time. But if it occurs during night-time hours, because we don't collect historical data of the MV, we can easily lose this data of which MV and actor is the bottleneck. |
True. With Prometheus approach we're able to retrieve historical statistics. That's also one of reasons I chose Grafana to visualize the back-pressure graph in the very first attempt (#13422). |
Actually to put the graph is grafana is sufficient. Because we can view the graph at a specific point-in-time. |
Based on our discussions here's my thoughts on the approach:
|
An idea: Shall we show the current epoch for each actor in the DAG graph? Particularly, we may highlight the actors that are in the lowest epoch. This is an useful technique when we use await-tree to find bottleneck. Besides, it doesn't need Promethues because the actor's epoch can be found in |
There can be a case where upstream and downstream actors are in the same epoch, but the data between them is backpressured due to the downstream actor being a bottleneck. In such a case it seems only the info about |
Yeah, I guess both of these information can lead to the same conclusion - which actor/fragment/MV is the bottleneck. Basically, there will be a subgraph in the entire graph that has low epoch number and high backpressure. |
Another thing worth to mention: sometimes the backpressure metrics can't directly point out the real cause of the problem. For example, a 1000x join amplification may happen at actor A,
If B is simple, the 1000x traffic may not cause any issue. It might be until a further downstream actor C, that has some complicated window function logic, and finally become a bottleneck. To address the problem correctly, I think throughput is another important metrics. We had better show both of these info in the visualization graph. |
If the duration the actor blocks is longer than the interval at which we poll the meta service, it will currently show up as no backpressure in the dashboard. That's because when we poll, because the actor is still blocking, there's no change in the blocking duration metrics. Only when the actor has output the chunk downstream it will increment the blocking duration metrics. see:
see: risingwave/src/stream/src/executor/dispatch.rs Lines 126 to 130 in 4d0a201
When stream graph is congested, we often see in the await tree that an actor can be blocked for several minutes. So this issue can definitely occur. #18215 is a fall-back solution, needs further investigation. #18219 can mitigate it, as long as the blocking duration is not too long. |
Remaining task is: #18215. We can close this as the dashboard work is done. |
The text was updated successfully, but these errors were encountered: