Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: use stream_key as MV's distribution key #12824

Closed
fuyufjh opened this issue Oct 12, 2023 · 10 comments · Fixed by #13022
Closed

Discussion: use stream_key as MV's distribution key #12824

fuyufjh opened this issue Oct 12, 2023 · 10 comments · Fixed by #13022
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Oct 12, 2023

Currently, the MV's distribution key is decided by the last executor before Materialize. This works well for group-by query, but may potentially perform bad for Join.

274604697-777bd1f5-45d8-44d5-b2e9-06198f6f6826

Particularly, consider fact table join dimension table as an example. Using join key as MV's distribute key

  • Pros. It saves one shuffle
  • Cons. The join key is very likely to be unbalanced, because the join key i.e. PK of the dimension table not only has a smaller cardinality, but also the data distribution in the fact table may be very uneven (imagine the nation table in TPC-H).

Example: The MV is unevenly distributed, which caused highly skewed throughput in BackfillExecutor .

dev=> select rw_vnode(dist_key_column)  as vnode, count(*) from some_table group by vnode;
 vnode |  count  
-------+---------
...
    48 |     243
    44 |    3810
   176 |   14081
    62 |     222
   169 |     452
    45 |      51
   147 |    2683
   149 |     192
   213 |      57
    94 |    1100
   192 |      83
   199 |       3
   205 |    1479
   142 |    2999
    63 | 1719661  <-- THE HEAVY HITTER!!!
    58 | 1790350  <-- THE HEAVY HITTER!!!
   242 |   10259
   226 |     784
    13 |    7497
....
(61 rows)
image

Proposal

I think it's better to use the stream_key as the distribution key (as well as PK). Regarding Join, the derive_stream_key procedure keeps the stream key from both sides, which guarantees the MV can be evenly distributed.

Furthermore, #12820 will significantly reduce the length (# columns) of the derived stream key. In most cases, such as star/snowflake-schema join, the MV's PK will merely consist of the PK of the central fact table.

Besides, because the distribution key has such a huge impact on query performance, I would like to introduce a distribued by <column_list> syntax as well, allowing users to specify the distribution column manually (however, hidden column might be a problem...)

Implementation

Under the hood, we need to introduce an extra shuffle and an extra "epoch-level DataChunk compaction" (See #12458) before MaterializeExecutor.

@github-actions github-actions bot added this to the release-1.4 milestone Oct 12, 2023
@chenzl25
Copy link
Contributor

+1 to let MV's distribution key be the same as the primary key. This can make backfilling more scalable and prevent potential data skewing. Though it will somehow hurt point select performance, we have indexes to speed up this workload.

@fuyufjh
Copy link
Member Author

fuyufjh commented Oct 13, 2023

  • For GroupTopN, it seems not worth - The distribution key is the group key, which is good enough, no need to include the TopN's PK as well.
  • For TopN and Group-By, nothing to change.
  • For Join, this is the motivation, of course.
  • (Don't know whether there are more cases)

I recommend limiting the scope to Join only i.e. Join is immediately before Materialize as shown in that diagram, because this is a clear bad case.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 24, 2023

Apart from optimizing the mv distribution key, is it possible to provide a way to let backfill use the index state table? If possible, it will be much easier to handle the data skewness problem in the future because user can just create an index.

I view backfill as some kind of full table scan batch query. Since index can accelerate batch query, I think it is reasonable to use index for backfill.

@chenzl25
Copy link
Contributor

Yes. Previously we used indexes for backfilling in the delta join case. Theoretically, using indexes can also be used to speed up backfilling workloads like create materialized view mv as select * from t where idx_col = xxx. However, this case is rare in streaming queries.

@chenzl25
Copy link
Contributor

chenzl25 commented Oct 24, 2023

The con is that the index cannot be dropped if it is used in streaming queries.

@fuyufjh
Copy link
Member Author

fuyufjh commented Oct 24, 2023

Is it possible that use a stream key but excluding the hidden join key?

This is because:

  • The hidden join key was only introduced to avoid the disordering problem brought by Join, somehow it's a problem of implementation, so I prefer to hide it and not expose this to users.
  • If there is a data skew, we can clearly tell the user that why your data is skewed according to the PK. (On the opposite, we have to explain the hidden join key, which was soooooo difficult)
  • Once we have sink-into-table, this behavior will be consistent with that.

@chenzl25
Copy link
Contributor

Is it possible that use a stream key but excluding the hidden join key?

I think this property is the unique_key property. But currently, we just have stream_key property. I think we can change to it after we implement the unique_key.

@st1page
Copy link
Contributor

st1page commented Oct 24, 2023

Is it possible that use a stream key but excluding the hidden join key?

I think this property is the unique_key property. But currently, we just have stream_key property. I think we can change to it after we implement the unique_key.

If we do that, we should compact the chunk with the stream key to avoid the mis-order problem

@chenzl25
Copy link
Contributor

We just want to use unique keys as the distribution keys, so it is fine. The stream key is not changed.

@st1page
Copy link
Contributor

st1page commented Oct 24, 2023

We just want to use unique keys as the distribution keys, so it is fine. The stream key is not changed.

oh, you are right, we only need to nesure that the distribution keys is the subset of the stream key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants