-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: use stream_key
as MV's distribution key
#12824
Comments
+1 to let MV's distribution key be the same as the primary key. This can make backfilling more scalable and prevent potential data skewing. Though it will somehow hurt point select performance, we have indexes to speed up this workload. |
I recommend limiting the scope to Join only i.e. |
Apart from optimizing the mv distribution key, is it possible to provide a way to let backfill use the index state table? If possible, it will be much easier to handle the data skewness problem in the future because user can just create an index. I view backfill as some kind of full table scan batch query. Since index can accelerate batch query, I think it is reasonable to use index for backfill. |
Yes. Previously we used indexes for backfilling in the delta join case. Theoretically, using indexes can also be used to speed up backfilling workloads like |
The con is that the index cannot be dropped if it is used in streaming queries. |
Is it possible that use a stream key but excluding the hidden join key? This is because:
|
I think this property is the |
If we do that, we should compact the chunk with the stream key to avoid the mis-order problem |
We just want to use unique keys as the distribution keys, so it is fine. The stream key is not changed. |
oh, you are right, we only need to nesure that the distribution keys is the subset of the stream key. |
Currently, the MV's distribution key is decided by the last executor before Materialize. This works well for group-by query, but may potentially perform bad for Join.
Particularly, consider fact table join dimension table as an example. Using join key as MV's distribute key
nation
table in TPC-H).Example: The MV is unevenly distributed, which caused highly skewed throughput in
BackfillExecutor
.Proposal
I think it's better to use the
stream_key
as the distribution key (as well as PK). Regarding Join, thederive_stream_key
procedure keeps the stream key from both sides, which guarantees the MV can be evenly distributed.Furthermore, #12820 will significantly reduce the length (# columns) of the derived stream key. In most cases, such as star/snowflake-schema join, the MV's PK will merely consist of the PK of the central fact table.
Besides, because the distribution key has such a huge impact on query performance, I would like to introduce a
distribued by <column_list>
syntax as well, allowing users to specify the distribution column manually (however, hidden column might be a problem...)Implementation
Under the hood, we need to introduce an extra shuffle and an extra "epoch-level DataChunk compaction" (See #12458) before
MaterializeExecutor
.The text was updated successfully, but these errors were encountered: