RFC: Spill Hash Aggregation #89

chenzl25 · 2024-05-13T05:34:52Z

Tracking issue: risingwavelabs/risingwave#16615

fuyufjh · 2024-05-13T06:59:59Z

The design looks good to me. 👍

I am thinking about another approach. Instead of adapting the HashAgg/HashJoin to be spill-able by using partitioning, we might just replace the in-memory hash table with some disk-based structure, such as B-Tree e.g. sled, LSM-Tree e.g. RocksDB or on-disk hash table e.g. odht. The major benefit of this disk-based approach is that it doesn't consume any memory except some page cache, which is always safe to evict by the OS, so that the memory manager doesn't need to worry about batch memory consumption at all.

I do agree that hybrid hash agg/join (this proposal) is the most commonly used approach in database area, which implies it may have the best performance when memory is not a bottleneck. But I am also curious about the performance gap between it and the disk-based approach. If the performance gap is small, it might be a good fit for RisingWave's case.

chenzl25 · 2024-05-13T08:24:23Z

I am thinking about another approach. Instead of adapting the HashAgg/HashJoin to be spill-able by using partitioning, we might just replace the in-memory hash table with some disk-based structure, such as B-Tree e.g. sled, LSM-Tree e.g. RocksDB or on-disk hash table e.g. odht.

Some drawbacks I come up with using a disk-based structure:

We can't monitor the memory consumed by this structure precisely and monitoring the memory consumption in batch query is important, because whether to spill or throw an OOM error to users is determined by the memory monitor.
Considering the data size of the current query is too large to be resident in memory, the IO should be the bottleneck. Therefore, using sequential IO or random IO matters. Using a disk-based structure potentially would introduce random IOs which perhaps could be mitigated with OS page cache, but I think when memory is not enough, page cache might not help in this case.
Monitoring the disk IO is also important after spilling is enabled. With a disk-based structure, it might be hard to monitor the actual IO.

fuyufjh · 2024-05-15T03:09:11Z

rfcs/0089-spill-hash-aggregation.md

+
+## Unresolved questions
+
+The above algorithm relies on the aggregation state that could be serialized. As far as I know, if the agg state is `AggregateState::Any`, they can't encode the state well, so this algorithm is only applicable to `AggregateState::Datum`. I think the most common aggregation function we know e.g. `count`, `sum`, `min`, `max` belongs to `AggregateState::Datum`, so it is fine. Any improvement later are welcome.


I think we may convert the AggregateState (of multiple groups, together) into some form of StreamChunks first, and then reuse the serialization of StreamChunk, so that it can benefit from the columnar format of StreamChunk

fuyufjh · 2024-05-15T03:12:29Z

rfcs/0089-spill-hash-aggregation.md

+[proto_len]
+[proto_bytes]
+
+```


I am not very sure, should we add an CRC here? Personally, I tend to add CRC for any persisted data.

Sounds good, we can add a CRC at the end of the file.

kwannoel · 2024-06-14T04:08:22Z

rfcs/0089-spill-hash-aggregation.md

+
+### Partitions
+
+First, we need to choose a partition number to partition the hash table and input chunks. After partitioning, theoretically, each partition would only contain 1/partition_num of the original data. If this size could be fitted in the memory, we can process the HashAgg partition by partition. If this size is still too large to be fitted in the memory, we need to recursively apply the spill algorithm. When recursively applying the spill algorithm, we need to make sure they use different hash functions to avoid data skew.


When recursively applying the spill algorithm

Could you elaborate on this? Why would we need to recursively apply the spill algorithm? Don't we just spill the entire partition to disk each time?

Why not just use vnode, instead of a separate partition strategy?

Why would we need to recursively apply the spill algorithm?

Because we don't know how much data the input has. Even if we have 20 partitions by default, a single partition data size still could be too large to fit in memory, so we need to spill the partition again, i.e. recursively apply.

Why not just use vnode, instead of a separate partition strategy?
For batch query, we don't always have a vnode. Vnode is only associated with the data within a table, but batch hash join input could be intermediate data.

rfcs/0089-spill-hash-aggregation.md

Co-authored-by: Noel Kwan <[email protected]>

chenzl25 added 2 commits May 13, 2024 13:21

spill hash aggregation

1e7efa4

Add image

4782218

chenzl25 mentioned this pull request May 13, 2024

Tracking: batch query memory control and spill risingwavelabs/risingwave#16615

Closed

5 tasks

fuyufjh reviewed May 15, 2024

View reviewed changes

chenzl25 mentioned this pull request May 15, 2024

feat(batch): support spill hash agg for the batch query risingwavelabs/risingwave#16771

Merged

9 tasks

chenzl25 mentioned this pull request Jun 5, 2024

feat(batch): support spill hash join risingwavelabs/risingwave#17122

Merged

9 tasks

kwannoel reviewed Jun 14, 2024

View reviewed changes

rfcs/0089-spill-hash-aggregation.md Outdated Show resolved Hide resolved

Update rfcs/0089-spill-hash-aggregation.md

a579d4d

Co-authored-by: Noel Kwan <[email protected]>

fuyufjh approved these changes Dec 4, 2024

View reviewed changes

fuyufjh merged commit 24e8e6e into main Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Spill Hash Aggregation #89

RFC: Spill Hash Aggregation #89

chenzl25 commented May 13, 2024 •

edited

Loading

fuyufjh commented May 13, 2024 •

edited

Loading

chenzl25 commented May 13, 2024

fuyufjh May 15, 2024 •

edited

Loading

fuyufjh May 15, 2024

chenzl25 May 15, 2024

kwannoel Jun 14, 2024 •

edited

Loading

kwannoel Jun 14, 2024

chenzl25 Jun 17, 2024

chenzl25 Jun 17, 2024


		## Unresolved questions

		The above algorithm relies on the aggregation state that could be serialized. As far as I know, if the agg state is `AggregateState::Any`, they can't encode the state well, so this algorithm is only applicable to `AggregateState::Datum`. I think the most common aggregation function we know e.g. `count`, `sum`, `min`, `max` belongs to `AggregateState::Datum`, so it is fine. Any improvement later are welcome.


		### Partitions

		First, we need to choose a partition number to partition the hash table and input chunks. After partitioning, theoretically, each partition would only contain 1/partition_num of the original data. If this size could be fitted in the memory, we can process the HashAgg partition by partition. If this size is still too large to be fitted in the memory, we need to recursively apply the spill algorithm. When recursively applying the spill algorithm, we need to make sure they use different hash functions to avoid data skew.

RFC: Spill Hash Aggregation #89

RFC: Spill Hash Aggregation #89

Conversation

chenzl25 commented May 13, 2024 • edited Loading

fuyufjh commented May 13, 2024 • edited Loading

chenzl25 commented May 13, 2024

fuyufjh May 15, 2024 • edited Loading

Choose a reason for hiding this comment

fuyufjh May 15, 2024

Choose a reason for hiding this comment

chenzl25 May 15, 2024

Choose a reason for hiding this comment

kwannoel Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

kwannoel Jun 14, 2024

Choose a reason for hiding this comment

chenzl25 Jun 17, 2024

Choose a reason for hiding this comment

chenzl25 Jun 17, 2024

Choose a reason for hiding this comment

chenzl25 commented May 13, 2024 •

edited

Loading

fuyufjh commented May 13, 2024 •

edited

Loading

fuyufjh May 15, 2024 •

edited

Loading

kwannoel Jun 14, 2024 •

edited

Loading