Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: source splits can be unevenly assigned to workers when there are too many actors #14333

Open
fuyufjh opened this issue Jan 3, 2024 · 11 comments
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Jan 3, 2024

When three CNs' memory usage is uneven, it will be OOM.

Yeah. I'd like to focus on this abnormality first. It began since reglngvty-20231228-150237, whil reglngvty-20231227-150231 looked normal.

Number of actor per node is even...🤔

The source split is not even since nightly-20231228

count(source_partition_input_bytes{namespace=~"$namespace",risingwave_name=~"$instance",risingwave_component=~"$component",pod=~"$pod"}) by (pod)

reglngvty-20231228-150237 (nightly-20231228)

image

reglngvty-20231227-150231 (nightly-20231227)

image

Code diff: 4695ad1...aa9dcac

Any ideas? cc. @shanicky

Originally posted by @fuyufjh in #14324 (comment)

@github-actions github-actions bot added this to the release-1.6 milestone Jan 3, 2024
@fuyufjh fuyufjh changed the title kafka source split uneven bug: kafka source split uneven Jan 3, 2024
@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

Is it possible to be caused by #14170

especially the last commit

@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

It seems previously the assignment is kind of random (?), but now all assigned to one node.

@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

How many source actors will be created in this case? If there are >24 actors in a node I think #14170 will lead to the problem.. 🤔

@lmatz lmatz added the type/bug Something isn't working label Jan 3, 2024
@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

I get it. We have 3 sources, each one have 8 partitions.

For each source, each compute node has 8 source actors because it has 8 cores. 24 actors in total. Previously it's randomly assigned, so sometimes it's relatively even (7-8-9), sometimes it's uneven (3-8-13), and can OOM.

Now I added a cmp by actor id to make the assignment deterministic, so all 8 splits will be assigned to 8 actors with lowest actor ids (on the same node). This is the same for all 3 sources, so one node will be assigned 24 splits.

The previous behavior is also not ideal. How can we improve that? 🤔 Basically the problem is SourceManager is not aware of the cluster now.

@lmatz
Copy link
Contributor

lmatz commented Jan 3, 2024

https://buildkite.com/risingwave-test/longevity-test/builds/883#018ccab2-3a2b-4d12-aace-6757affb4abe

SCR-20240103-qoz

SCR-20240103-qo5

1 topic, 1 unified source, parallelism 3

create MV for each logical source, 3 mvs in total

and there are 8*25 nexmark MVs built on top of these 3 base MVs

@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

Wait, if streaming_parallelism is 3, my reasoning above doesn't seem correct. 🤡

@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

But we do have 8 actors on a node, is there something wrong?
image

@xxchan
Copy link
Member

xxchan commented Jan 3, 2024

I guess in the benchmark script, the paralleism only affects query MVs, but not the 3 source MVs, so my reasoning still applies.

@shanicky
Copy link
Contributor

shanicky commented Jan 3, 2024

But we do have 8 actors on a node, is there something wrong?

image

Only actors assigned with splits are displayed here. You can check the actor panel, where you will find that out of 96 actors, only the first 8 actors were assigned splits. In other words, only the first cn was assigned splits.

@xxchan xxchan changed the title bug: kafka source split uneven bug: source splits can be unevenly assigned to workers when there are too many actors Jan 5, 2024
@xxchan xxchan modified the milestones: release-1.6, release-1.7 Jan 9, 2024
@shanicky
Copy link
Contributor

shanicky commented Mar 6, 2024

shall we close this issue as fixed? cc @xxchan

@xxchan
Copy link
Member

xxchan commented Mar 6, 2024

I'm not sure. We haven't implemented rack-aware scheduling, so the problem can still happen. Do you think it's not a large concern and we will not implement it in near future? 🤔

@shanicky shanicky modified the milestones: release-1.7, release-1.8 Mar 13, 2024
@shanicky shanicky modified the milestones: release-1.8, release-1.10 May 8, 2024
@shanicky shanicky modified the milestones: release-1.10, release-1.11 Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants