feat(meta): deprecate parallel unit #17523

shanicky · 2024-07-01T14:09:13Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR is a massive one, and the specific features still need to be verified. If possible, it may be split into multiple smaller PRs in the future. Due to the need to consider backwards compatibility, more testing is required. Currently, we need to ensure that it can consistently pass CI tests.

This PR made the following modifications:

Removed the dependency on persistent global ParallelUnit, replacing them with dynamic temporary WorkerSlots. Previously, ParallelUnit had two semantics: Worker and Worker Parallelism. So for each Worker with parallelism of Parallelism, we generated (WorkerId, 0), (WorkerId, 1) and so on to replace it. The vnode mapping in the communication between meta and the frontend also uses worker slots, which should be simplified to WorkerMapping in the future.
Removed the VnodeMapping of Fragment. Since VnodeMapping was implemented through ParallelUnitMapping, we can derive VnodeMapping from ActorBitmaps and ActorStatus, so a persistent VnodeMapping is not needed. Moreover, binding Fragments to Workers is actually quite strange.
Modified the Reschedule interface. The previous syntax was {fragment_id}-[parallel_unit]+[parallel_unit], which has now been changed to ~~{fragment_id}-[worker_id:count]+[worker_id:count]~~ {fragment_id}:[worker_id:diff]. Reschedule is now defined by modifying the number of allocations on workers. Simulation tests have all been updated to adapt.
~~Considering compatibility, rw_parallel_units is still retained, changed to the form of (slot_id, worker_id).~~
The ParallelUnit in ActorStatus has not been removed because I haven't figured out how to handle backward compatibility elegantly yet. 🥵 We will only use the WorkerId field. If it's a new ActorStatus, we will fill ParallelUnit with u32::MAX as the id.
The original interface for generating stable resize plans has been removed in this PR. Due to the introduction of alter syntax and auto scale, this feature is rarely used. If we were to modify it, this PR would continue to grow larger. Let's discuss this later.

Considering compatibility issues, when creating a streaming job, the maximum parallelism is still limited to the sum of worker parallelism. However, it is possible to manually alter the parallelism to a huge value.


dev=> set streaming_parallelism = 20;
SET_VARIABLE
dev=> create table t(v int);
ERROR:  Failed to run the query

Caused by these errors (recent errors listed first):
  1: gRPC request to meta service failed: The service is currently unavailable
  2: Service unavailable: Not enough parallelism to schedule, required: 20, available: 12

dev=> set streaming_parallelism = 0;
SET_VARIABLE
dev=> create table t(v int);
CREATE_TABLE
dev=> alter table t set parallelism = 100;
ALTER_TABLE
dev=> select fragment_id, count(*) from rw_actors group by fragment_id;
 fragment_id | count
-------------+-------
           3 |   100
           4 |   100
(2 rows)

You can find more specific reasons in the Comment #17523 (comment).

The following is an AI-generated summary.

Summary of Changes

This extensive PR contains a series of updates to both the Protocol Buffers definitions and Rust source code. These changes aim to streamline and simplify the management of worker nodes in our system.

Protocol Buffers

Changes to `proto/common.proto` and `proto/meta.proto`

The parallel_units field in the WorkerNode message has been removed; its field number is now reserved.
A new field, uint32 parallelism, has been introduced to the WorkerNode message, enhancing its descriptive power.
Redundancy reduction was performed by removing and reserving several fields in the TableFragments, MigrationPlan, and ListActorStatesResponse messages:
- Reserved vnode_mapping in TableFragments.
- Replaced and reserved parallel_unit_migration_plan in MigrationPlan with a new worker_slot_migration_plan field.
- Reserved parallel_unit_id in ListActorStatesFoResponse, substituting it with a new worker_id field.
The Reschedule message is now more appropriately named WorkerReschedule, with changes to the map types for added_parallel_units and removed_parallel_units.
The RescheduleRequest message has been simplified by removing the reschedules field and including worker_reschedules.
Deletion of GetReschedulePlanRequest and GetReschedulePlanResponse messages alongside their corresponding RPC method in ScaleService (i.e., GetReschedulePlan) to reflect the new design.

Rust Source Code

Updates in `src/batch/src/executor/join/local_lookup_join.rs`

Reflected the Protocol Buffers changes with a switch from using worker.parallel_units.len() to worker.parallelism as usize.

Refactor in `src/batch/src/worker_manager/worker_node_manager.rs`

Redesigned the worker slot to worker node mapping system following the removal of parallel_units, utilizing a new structure based on the id and parallelism.
Modified the total_parallelism function to compute using the new parallelism field rather than the length of parallel_units.

Adjustments in `src/common/src/hash/consistent_hash/mapping.rs`

WorkerSlotId now uses a custom implementation of fmt for its `Debug

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See [details]
All checks passed in ./risedev check (or alias, ./risedev c)

kwannoel · 2024-07-02T08:07:13Z

I suppose this is part of https://www.notion.so/risingwave-labs/From-Parallel-Unit-to-Actor-Group-b0b780a332f147be8ca469162c43d3e6?

we can derive VnodeMapping from ActorBitmaps and ActorStatus

Could you elaborate on this? IIUC it was previously used to map vnodes to parallel units (worker nodes).
I suppose now it's mapped to Actor Group. And so ActorBitmaps and ActorStatus somehow can be used to provide this mapping.

Removed the dependency on persistent global Parallel Units, replacing them with dynamic temporary WorkerSlots. Previously, Parallel Units had two semantics: Worker and Worker Parallelism. So for each Worker with parallelism of Parallelism, we generated (WorkerId, 0), (WorkerId, 1) and so on to replace it. The vnode mapping in the communication between meta and the frontend also uses worker slots, which should be simplified to WorkerMapping in the future.

What is a "WorkerSlot"? And which does it represent? Worker or worker parallelism?

shanicky · 2024-07-02T08:46:02Z

I suppose this is part of notion.so/risingwave-labs/From-Parallel-Unit-to-Actor-Group-b0b780a332f147be8ca469162c43d3e6?

we can derive VnodeMapping from ActorBitmaps and ActorStatus

Could you elaborate on this? IIUC it was previously used to map vnodes to parallel units (worker nodes). I suppose now it's mapped to Actor Group. And so ActorBitmaps and ActorStatus somehow can be used to provide this mapping.

Removed the dependency on persistent global Parallel Units, replacing them with dynamic temporary WorkerSlots. Previously, Parallel Units had two semantics: Worker and Worker Parallelism. So for each Worker with parallelism of Parallelism, we generated (WorkerId, 0), (WorkerId, 1) and so on to replace it. The vnode mapping in the communication between meta and the frontend also uses worker slots, which should be simplified to WorkerMapping in the future.

What is a "WorkerSlot"? And which does it represent? Worker or worker parallelism?

First, VnodeMapping previously used ParallelUnit as a key to map to Vnode. If combined with the Actor location (including ParallelUnitId and WorkerId) in ActorStatus, it's possible to derive ActorMapping (mapping ActorId to Vnode), and ActorMapping can be converted into a Bitmap for each Actor.

Therefore, we can also generate ActorMapping through Actor's Bitmap (handling both Single and Hash cases for Fragments), and then generate the Fragment's VnodeMapping (i.e., ParallelUnitMapping) using the ParallelUnit from ActorStatus.

WorkerSlot is an attempt at ParallelUnit. Theoretically, we should completely discard this intermediate layer, but there are too many dependencies. So at this stage, we've found a more temporary solution that can express the semantics of ParallelUnit. WorkerSlot is a temporarily generated Id, corresponding to each parallelism of each Worker. For example, for a Worker with Id 1 and 5 parallelisms, we generate five WorkerSlots: (1,0) (1,1) (1,2) (1,3) (1,4).

Previously, we mainly used ParallelUnit to align upstream and downstream of NoShuffle relationships (because they need 1-to-1 correspondence and need to be on the same Worker) through the same ParallelUnitId. This mainly appears in two scenarios:

During scheduling, we use WorkerSlot to ensure this correspondence, meaning the WorkerSlots of NoShuffle's upstream and downstream are consistent.
During scaling, which is the rescheduling process, the original 1-to-1 relationship is retained in the actor's dispatcher. We can rebuild based on this relationship without needing WorkerSlot alignment. For upstream and downstream in cascade no shuffle expansion, we use temporary WorkerSlots to solve it (same as scheduling).

In summary, we previously bound everything to ParallelUnit and established their bindings, which created significant limitations. For instance, an Actor couldn't run without a ParallelUnit. So we can bind these concepts through other means, allowing us to remove the dependency on ParallelUnit and downgrade it to temporary WorkerSlots that only exist during the scheduling process.

proto/meta.proto

xxchan · 2024-07-03T06:51:14Z

This PR is a massive one, and the specific features still need to be verified. If possible, it may be split into multiple smaller PRs in the future. Due to the need to consider backwards compatibility, more testing is required. Currently, we need to ensure that it can consistently pass CI tests.

Some thoughts on the topic of "splitting PR":

We can have a "code deletion" PR first (e.g., GetReschedulePlan). I guess it's trivial to merge, and can reduce a lot of LoC of this PR.
Another idea is to do the "concept mapping" (or do abstraction) refactoring first: e.g., previously we have worker.parallel_units.len(). We can first refactor that into a method worker.parallelism() (without changing the implementation). And in the new PR, we just need to change the implementation. (But I'm not sure how much similar things we can do for this PR, and how much it would help)
Graphite can help a lot to work on the splitted PRs simultaneously! (specifically, automatically rebase later PRs when you make changes to prior ones)

Since I'm not an expert on this area, so they are just ideas on the general topic that immediately came to me, and they might not apply to this specific work. (Or it's just too late and troublesome to split now ...?)

proto/meta.proto

xxchan · 2024-07-03T08:46:19Z

src/meta/model_v2/migration/src/m20240630_131430_remove_parallel_unit.rs

+        manager
+            .alter_table(
+                Table::alter()
+                    .table(WorkerProperty::Table)
+                    .add_column(
+                        ColumnDef::new(WorkerProperty::Parallelism)
+                            .integer()
+                            .not_null(),
+                    )
+                    .to_owned(),
+            )
+            .await?;
+
+        manager
+            .alter_table(
+                Table::alter()
+                    .table(WorkerProperty::Table)
+                    .drop_column(WorkerProperty::ParallelUnitIds)
+                    .to_owned(),
+            )
+            .await?;


I saw you talked about compatibility, but it seems the specific behavior here is not clearly discussed: We persisted parallel unit ids for tables, but dropped columns and added new columns here, so what will happen exactly for existing jobs?

During upgrades, compute-node will restart and re-trigger add_worker, so the parallelism field will be repopulated. We won't use the parallel unit id of actors and the vnode mapping field of fragments. All the necessary information can be derived from existing data.

Just to be safe, I'll still manually populate the parallelism field.

However, it seems our current architecture doesn't support downgrading very well 🫠 , so it's best to backup before upgrading.

We won't use the parallel unit id of actors and the vnode mapping field of fragments. All the necessary information can be derived from existing data.

I'm still a little confused. e.g., IIUC we can have a table scheduled to a given worker with a fixed parallelism.

Now we dropped the persisted info (the table's parallel units), how can we recover it?

BTW, I think I commented in the wrong place. I should have commented on Fragments

We will read the WorkerId from the ActorStatus field, then determine its location. Previously, we would use the ParallelUnitId from ActorStatus.

Thanks, I didn't notice ActorStatus is part of TableFragments, and might misunderstand how everything works.

The ParallelUnit in ActorStatus has not been removed because I haven't figured out how to handle backward compatibility elegantly yet. 🥵 We will only use the WorkerId field. If it's a new ActorStatus, we will fill ParallelUnit with u32::MAX as the id.

BTW, this might be worth writing in the proto comments.

This will be removed in the next PR. 🥵

src/frontend/src/catalog/system_catalog/rw_catalog/rw_parallel_units.rs

BugenZhao

Haven't delved into this PR in detail, but I'm curious about where we maintain the mapping from vnode to worker node for each fragment in the new implementation. 🤔

Do we persist the mapping?
Do we directly use worker id?
Is worker id reusable?

shanicky · 2024-07-03T09:46:48Z

Haven't delved into this PR in detail, but I'm curious about where we maintain the mapping from vnode to worker node for each fragment in the new implementation. 🤔

Do we persist the mapping?

Do we directly use worker id?

Is worker id reusable?

We will still store the worker id of the actor (through actor status), but the mapping from worker to vnode is dynamically generated and not stored. We only store the bitmap of the actor. The main reason is that the information in the actor's bitmap is sufficient, so we can use the existing actor bitmap to reconstruct and generate the ActorMapping. Then, we use a stable (seemingly) algorithm to generate the WorkerSlotMapping, which essentially involves grouping and sorting the actors of the fragment by worker from small to large, and allocating WorkerSlotId accordingly. So I still need your help to confirm if there would be any problems with this approach. 🥹

What does "worker id reuseable" refer to? 🤔 This PR doesn't modify the handling of worker ids, so it remains the same as before.

xxchan

I pulled the PR locally and browsed some parts. I have a feeling that most parts are relatively trivial, i.e., just change the data passed down. 🤣 (Well, this might indeed be the largest trickiness of large PRs: We cannot distinguish "real important" changes from others)

Perhaps you have some ideas about "where is really important change that need to be carefully reviewed", and you can comment inline to highlight them, or have any guidance about how you want the PR to be reviewed.

shanicky · 2024-07-03T10:09:09Z

I pulled the PR locally and browsed some parts. I have a feeling that most parts are relatively trivial, i.e., just change the data passed down. 🤣 (Well, this might indeed be the largest trickiness of large PRs: We cannot distinguish "real important" changes from others)

Perhaps you have some ideas about "where is really important change that need to be carefully reviewed", and you can comment inline to highlight them, or have any guidance about how you want the PR to be reviewed.

Yes, most document modifications are limited to code changes like worker.parallel_units.len() -> worker.parallelism, the core modifications should be files in stream/stream_graph, recovery.rs, and scale.rs.

BugenZhao · 2024-07-04T03:42:54Z

We will still store the worker id of the actor (through actor status), but the mapping from worker to vnode is dynamically generated and not stored. We only store the bitmap of the actor.

This sounds promising. 👍

What does "worker id reuseable" refer to? 🤔

For example, if a completely new set of compute nodes join and replace the original ones, do we need to rewrite the persisted worker id in the actor? Previously it's the parallel unit id that is persisted, so we only need to update the mapping from parallel unit to worker node in other places.

shanicky · 2024-07-04T06:33:37Z

For example, if a completely new set of compute nodes join and replace the original ones, do we need to rewrite the persisted worker id in the actor? Previously it's the parallel unit id that is persisted, so we only need to update the mapping from parallel unit to worker node in other places.

If the worker changes, we will trigger scale or actor migration. In the case of scale and migration, we only need to update the WorkerId in the ActorStatus field. Previously, it was complicated because we also had to update the fragment mapping, but now it's not necessary. We only need to change the actor's location.

scaling: https://github.com/risingwavelabs/risingwave/pull/17523/files#diff-64e58f6fb6513a71def5d29b781d04f146c4f594d2678b916e003cc1c00ded73L1449-L1451
migration: https://github.com/risingwavelabs/risingwave/pull/17523/files#diff-243aa8215b544fab44806f8910e08aab06133d79de2d84e4f65ed056bf413ac5R866-R868

In fact, in the future, it may not be necessary to store the Actor's location, or even store Actor information at all. If the algorithm is stable enough, each Meta startup recovery will recalculate the correct Actor information.

…sm fields & simplify logic

…tch proc logic

Refactor `scale_service.rs` and remove `num_traits::Signed` from `scale.rs`. Update rescheduling logic, test module, & integration tests.

… dynamically based on need

yezizp2012

LGTM! In the past two days, @shanicky have continuously tested this PR and haven't found any issues, I feel it can be merged first. @BugenZhao WDUT?

shanicky · 2024-07-23T06:28:02Z

Let's merge this PR and roll it back if there are any critical issues.

github-actions bot added type/feature ci/run-e2e-single-node-tests labels Jul 1, 2024

shanicky marked this pull request as ready for review July 1, 2024 14:09

shanicky mentioned this pull request Jul 1, 2024

[closed] feat: Deprecate Parallel Unit #17507

Closed

4 tasks

fuyufjh requested review from yezizp2012, wenym1 and kwannoel July 2, 2024 06:30

xxchan requested review from BugenZhao and removed request for BugenZhao July 2, 2024 07:43

kwannoel added ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. ci/run-recovery-test-deterministic-simulation ci/main-cron/run-selected labels Jul 2, 2024

shanicky force-pushed the peng/remove-pu-union branch from 9c57fe1 to 120fee5 Compare July 2, 2024 08:55

xxchan reviewed Jul 3, 2024

View reviewed changes

proto/meta.proto Show resolved Hide resolved

proto/meta.proto Show resolved Hide resolved

xxchan reviewed Jul 3, 2024

View reviewed changes

proto/meta.proto Outdated Show resolved Hide resolved

xxchan reviewed Jul 3, 2024

View reviewed changes

src/frontend/src/catalog/system_catalog/rw_catalog/rw_parallel_units.rs Outdated Show resolved Hide resolved

BugenZhao reviewed Jul 3, 2024

View reviewed changes

shanicky force-pushed the peng/remove-pu-union branch from 120fee5 to da5026f Compare July 3, 2024 09:57

xxchan reviewed Jul 3, 2024

View reviewed changes

shanicky mentioned this pull request Jul 4, 2024

chore: Add serde and serde_json v1 with derive to Cargo.toml in SQL migration #17564

Merged

3 tasks

shanicky force-pushed the peng/remove-pu-union branch from c06d4e1 to 41117dd Compare July 4, 2024 07:15

shanicky force-pushed the peng/remove-pu-union branch 4 times, most recently from 777760b to 1483ce5 Compare July 19, 2024 07:59

shanicky added 19 commits July 22, 2024 13:13

Refactor parallelism handling: Shift from parallel units to paralleli…

d2a6099

…sm fields & simplify logic

Refactor worker manager & meta mgmt, simplify worker map & enhance ba…

def21c2

…tch proc logic

Add WorkerReschedule; rm/add rw_ modules; rename rs file.

7518c2e

Deprecate vnode_mapping, reserve field 5, update MigrationPlan

fa877cb

Add parallelism column to WorkerProperty migration

4804087

Refactor, docs update, and tweak parallelism access

4ef6db4

Cleanup imports and update Bitmap path in 4 Rust files

4969ce1

Removed misnamed get_fragment_ids_by_jobs function

6c6c73d

Renamed DDL vars for clarity

77e738a

Refactor ddl_controller: simplify rand, update imports, clean code

55e5819

Refactor: Remove worker slot mgmt, update actor/worker ID types

e858923

Clean up imports in recovery and streaming_job.rs

d43ba9d

Refactor worker rescheduling logic with unified actor diff mapping

32bb7b9

Refactor `scale_service.rs` and remove `num_traits::Signed` from `scale.rs`. Update rescheduling logic, test module, & integration tests.

Add resource string format comment; skip non-Running actors.

0e7efb3

Add Copy to DistributionType, fix worker_id cast, rm Bitmap import

f66bf63

Updated job_ids to ObjectId; refactored loop and get_fragment_mappings.

9a5cb0a

Refactor ActorMapping iteration and worker lookup in mapping.rs

65f9452

Add migration timeout; update logging terminology; scale worker slots…

d59bc6d

… dynamically based on need

Update assert in sink recovery test to allow equal counts

36f3c83

shanicky force-pushed the peng/remove-pu-union branch from 1483ce5 to 36f3c83 Compare July 22, 2024 05:26

yezizp2012 approved these changes Jul 22, 2024

View reviewed changes

shanicky added this pull request to the merge queue Jul 23, 2024

Merged via the queue into main with commit 007e802 Jul 23, 2024
34 of 35 checks passed

shanicky deleted the peng/remove-pu-union branch July 23, 2024 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(meta): deprecate parallel unit #17523

feat(meta): deprecate parallel unit #17523

shanicky commented Jul 1, 2024 •

edited

Loading

kwannoel commented Jul 2, 2024

shanicky commented Jul 2, 2024

xxchan commented Jul 3, 2024

xxchan Jul 3, 2024

shanicky Jul 3, 2024

xxchan Jul 3, 2024

shanicky Jul 3, 2024

xxchan Jul 4, 2024 •

edited

Loading

shanicky Jul 9, 2024

BugenZhao left a comment

shanicky commented Jul 3, 2024 •

edited

Loading

xxchan left a comment

shanicky commented Jul 3, 2024

BugenZhao commented Jul 4, 2024

shanicky commented Jul 4, 2024

yezizp2012 left a comment

shanicky commented Jul 23, 2024

feat(meta): deprecate parallel unit #17523

feat(meta): deprecate parallel unit #17523

Conversation

shanicky commented Jul 1, 2024 • edited Loading

What's changed and what's your intention?

Summary of Changes

Protocol Buffers

Changes to proto/common.proto and proto/meta.proto

Rust Source Code

Updates in src/batch/src/executor/join/local_lookup_join.rs

Refactor in src/batch/src/worker_manager/worker_node_manager.rs

Adjustments in src/common/src/hash/consistent_hash/mapping.rs

Checklist

kwannoel commented Jul 2, 2024

shanicky commented Jul 2, 2024

xxchan commented Jul 3, 2024

xxchan Jul 3, 2024

Choose a reason for hiding this comment

shanicky Jul 3, 2024

Choose a reason for hiding this comment

xxchan Jul 3, 2024

Choose a reason for hiding this comment

shanicky Jul 3, 2024

Choose a reason for hiding this comment

xxchan Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

shanicky Jul 9, 2024

Choose a reason for hiding this comment

BugenZhao left a comment

Choose a reason for hiding this comment

shanicky commented Jul 3, 2024 • edited Loading

xxchan left a comment

Choose a reason for hiding this comment

shanicky commented Jul 3, 2024

BugenZhao commented Jul 4, 2024

shanicky commented Jul 4, 2024

yezizp2012 left a comment

Choose a reason for hiding this comment

shanicky commented Jul 23, 2024

shanicky commented Jul 1, 2024 •

edited

Loading

Changes to `proto/common.proto` and `proto/meta.proto`

Updates in `src/batch/src/executor/join/local_lookup_join.rs`

Refactor in `src/batch/src/worker_manager/worker_node_manager.rs`

Adjustments in `src/common/src/hash/consistent_hash/mapping.rs`

xxchan Jul 4, 2024 •

edited

Loading

shanicky commented Jul 3, 2024 •

edited

Loading