feat: reduce stream barrier body and only send one copy of it to each CN #14533

yezizp2012 · 2024-01-12T08:31:07Z

When the parallelism in the cluster is relatively high and there are many multi-way join streaming jobs, the barrier body BarrierMutation on the stream will be amplified multiple times. When flowing between different compute nodes, it will cause a significant amount of memory usage for prost message decoding and may result in OOM. Here is a solution to fix it, details described as bellow:

Some thoughts discussed with @st1page , there is one feasible optimization solution to change the process of the barrier:
Before sending the barrier, it can be sent to the local barrier manager on compute node first. When injecting the barrier, we can provide the id (epoch) only and let the actors to read specific mutation information from local barrier manager if necessary. By this way, BarrierMutation only needs to be decoded once on each compute node.

Originally posted by @yezizp2012 in #13060 (comment)

The text was updated successfully, but these errors were encountered:

BugenZhao · 2024-01-12T08:36:38Z

+1 for this. In this way the Stashed state can also be removed. 😄

risingwave/src/stream/src/task/barrier_manager/managed_state.rs

Lines 36 to 41 in da05733

    
               /// Barriers from some actors have been collected and stashed, however no `send_barrier` 
        
               /// request from the meta service is issued. 
        
               Stashed { 
        
                   /// Actor ids we've collected and stashed. 
        
                   collected_actors: HashSet<ActorId>, 
        
               },

kwannoel · 2024-03-12T03:33:54Z

Maybe related @fuyufjh

github-actions bot added this to the release-1.7 milestone Jan 12, 2024

yezizp2012 changed the title ~~feat: reduce stream barrier body and send one copy of barrier body to each CN~~ feat: reduce stream barrier body and only send one copy of it to each CN Jan 12, 2024

yezizp2012 self-assigned this Jan 12, 2024

yezizp2012 mentioned this issue Feb 2, 2024

Longevity test CN and Meta OOM nightly-20240201 #14944

Closed

zwang28 mentioned this issue Feb 26, 2024

bug: reschedule has much larger memory footprint than expected #13774

Closed

yezizp2012 mentioned this issue Mar 5, 2024

feat: execute auto-scaling in batches #15420

Merged

3 tasks

yezizp2012 modified the milestones: release-1.7, release-1.8 Mar 6, 2024

yezizp2012 mentioned this issue Mar 12, 2024

feat: avoid decode barrier when passing between different compute nodes #15644

Merged

9 tasks

yezizp2012 closed this as completed in #15644 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reduce stream barrier body and only send one copy of it to each CN #14533

feat: reduce stream barrier body and only send one copy of it to each CN #14533

yezizp2012 commented Jan 12, 2024 •

edited

Loading

BugenZhao commented Jan 12, 2024

kwannoel commented Mar 12, 2024

feat: reduce stream barrier body and only send one copy of it to each CN #14533

feat: reduce stream barrier body and only send one copy of it to each CN #14533

Comments

yezizp2012 commented Jan 12, 2024 • edited Loading

BugenZhao commented Jan 12, 2024

kwannoel commented Mar 12, 2024

yezizp2012 commented Jan 12, 2024 •

edited

Loading