You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the parallelism in the cluster is relatively high and there are many multi-way join streaming jobs, the barrier body BarrierMutation on the stream will be amplified multiple times. When flowing between different compute nodes, it will cause a significant amount of memory usage for prost message decoding and may result in OOM. Here is a solution to fix it, details described as bellow:
Some thoughts discussed with @st1page , there is one feasible optimization solution to change the process of the barrier:
Before sending the barrier, it can be sent to the local barrier manager on compute node first. When injecting the barrier, we can provide the id (epoch) only and let the actors to read specific mutation information from local barrier manager if necessary. By this way, BarrierMutation only needs to be decoded once on each compute node.
/// Barriers from some actors have been collected and stashed, however no `send_barrier`
/// request from the meta service is issued.
Stashed{
/// Actor ids we've collected and stashed.
collected_actors:HashSet<ActorId>,
},
yezizp2012
changed the title
feat: reduce stream barrier body and send one copy of barrier body to each CN
feat: reduce stream barrier body and only send one copy of it to each CN
Jan 12, 2024
When the parallelism in the cluster is relatively high and there are many multi-way join streaming jobs, the barrier body
BarrierMutation
on the stream will be amplified multiple times. When flowing between different compute nodes, it will cause a significant amount of memory usage for prost message decoding and may result in OOM. Here is a solution to fix it, details described as bellow:The text was updated successfully, but these errors were encountered: