Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(type): refactor ListValue using array #13402

Merged
merged 17 commits into from
Nov 29, 2023
Merged

Conversation

wangrunji0408
Copy link
Contributor

@wangrunji0408 wangrunji0408 commented Nov 14, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR changes the inner structure of ListValue from Box<[Datum]> to ArrayImpl. Now creating a ListValue requires its type in advance, just like array.

Pros:

  1. Elements in the list are forced to be the same type. And we can get the type.
  2. ListRef can be unified to a slice of an array.
  3. The unnest operation naturally becomes a slice of underlying array.
  4. Memory allocations for varchar[] values have been significantly reduced. (n -> 3)

Cons:

  1. It needs more memory allocations for primitive type lists. (1 -> 3 if has null, -> 2 if non null)
  2. It consumes more memory for small lists. (roughly when n <= 4, see the chart below)
cmp

For example, {1, null} :: bigint[] currently takes 1 malloc + 48 bytes, and will take 3 mallocs + 96 bytes after this PR.

// before: heap size = 48
Box<[Datum]>            // 24 * 2

// after:  heap size = 96
Box<ArrayImpl::Int64 {  // 72
    bitmap: Bitmap,     // 8
    values: Box<[i64]>, // 8 * 2
}>

I'm not 100% sure if this refactoring is a positive one. It seems worthwhile if we assume the list is not too small, and there are considerable uses for string lists. 🤔

This PR also:

  1. reduces the size of ArrayImpl from 88 bytes to 72 bytes.
  2. adds a lot of ListValue::from_iter functions to simplify list construction in unit tests.
  3. rewrites array_agg in stateful style, because ListValue becomes immutable.
  4. implements Clone and EstimateSize for array builders, as required by 3.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

@liurenjie1024
Copy link
Contributor

I think most operations are operated on ListArray rather ListValue? Is there any special case you want to improve?

@wangrunji0408
Copy link
Contributor Author

wangrunji0408 commented Nov 14, 2023

I think most operations are operated on ListArray rather ListValue? Is there any special case you want to improve?

Yes. The main purpose of this change is to ensure the same type for all elements. And it seems natural to reuse arrays in this case. There are 2 expressions (some-all and array-transform) that may benefit from the union on ListValue and array. Except them, there seems to be no other cases that can be improved in performance.

@liurenjie1024
Copy link
Contributor

The main purpose of this change is to ensure the same type for all elements. And it seems natural to reuse arrays in this case.

This sounds reasonable to me.

I think it's ok to move on.

Copy link
Contributor

@TennyZhuang TennyZhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many changes. Generally LGTM

src/common/src/array/arrow.rs Outdated Show resolved Hide resolved
src/common/src/array/list_array.rs Outdated Show resolved Hide resolved
@stdrc stdrc self-requested a review November 15, 2023 03:04
@BugenZhao
Copy link
Member

  1. reduces the size of ArrayImpl from 88 bytes to 72 bytes.

Instead of Box<ArrayImpl>, what about boxing every variant of ArrayImpl? 🥵

  • It needs more memory allocations for primitive type lists. (1 -> 3 if has null, -> 2 if non null)
  • It consumes more memory for small lists. (roughly when n <= 4, see the chart below)

Array is not optimized for low cardinality. Is it possible to leverage something like smallvec as the small "list" optimization?

@wangrunji0408
Copy link
Contributor Author

wangrunji0408 commented Nov 15, 2023

Instead of Box<ArrayImpl>, what about boxing every variant of ArrayImpl? 🥵

Do you mean Box<dyn Array>? 🤔

Array is not optimized for low cardinality. Is it possible to leverage something like smallvec as the small "list" optimization?

It's possible. But it adds complexity and I'm not sure if it is worthy. 🥵

Anyway, considering that "most operations are operated on ListArray rather ListValue", I think a little performance degradation on small list is acceptable.

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great to me!!

src/common/src/array/list_array.rs Show resolved Hide resolved
Comment on lines +438 to +442
impl FromIterator<bool> for ListValue {
fn from_iter<I: IntoIterator<Item = bool>>(iter: I) -> Self {
Self::new(iter.into_iter().collect::<BoolArray>().into())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to implement on generic scalar types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But I have a feeling that it requires an advanced type exercise. 🤪

Comment on lines +456 to +460
impl FromIterator<ListValue> for ListValue {
fn from_iter<I: IntoIterator<Item = ListValue>>(iter: I) -> Self {
Self::new(iter.into_iter().collect::<ListArray>().into())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be possibly misused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, I think it's clear that collecting a list of lists produces a list of lists. 😇

src/common/src/array/list_array.rs Show resolved Hide resolved
src/common/src/array/list_array.rs Show resolved Hide resolved
@BugenZhao
Copy link
Member

Do you mean Box<dyn Array>? 🤔

Nope. I mean

enum ArrayImpl {
  I16(Box<I16Array>)
  ...
}

It's possible. But it adds complexity and I'm not sure if it is worthy. 🥵

Anyway, considering that "most operations are operated on ListArray rather ListValue", I think a little performance degradation on small list is acceptable.

That makes sense. 👍 But it seems like we're heavily relying on ListValue and ArrayBuilder in array expressions. It will be better if we can...

  • use smallvec for builder,
  • or directly write to the list array builder.

@wangrunji0408
Copy link
Contributor Author

wangrunji0408 commented Nov 15, 2023

enum ArrayImpl {
  I16(Box<I16Array>)
  ...
}

Not recommended as it brings another indirection and malloc to other ArrayImpl usages.

or directly write to the list array builder.

Yes I'll do that in #13263.

Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks great to me!

src/expr/impl/src/scalar/array_remove.rs Outdated Show resolved Hide resolved
@wangrunji0408 wangrunji0408 added this pull request to the merge queue Nov 29, 2023
Copy link

codecov bot commented Nov 29, 2023

Codecov Report

Attention: 250 lines in your changes are missing coverage. Please review.

Comparison is base (ff90f8c) 68.31% compared to head (55369df) 68.28%.
Report is 5 commits behind head on main.

Files Patch % Lines
src/expr/core/src/expr/expr_some_all.rs 0.00% 43 Missing ⚠️
src/common/src/array/list_array.rs 85.66% 41 Missing ⚠️
src/common/src/field_generator/mod.rs 0.00% 24 Missing ⚠️
src/connector/src/parser/unified/avro.rs 0.00% 12 Missing ⚠️
src/expr/impl/src/scalar/regexp.rs 0.00% 10 Missing ⚠️
src/common/src/array/struct_array.rs 40.00% 9 Missing ⚠️
src/connector/src/parser/unified/json.rs 0.00% 8 Missing ⚠️
src/expr/impl/src/aggregate/bit_and.rs 0.00% 8 Missing ⚠️
src/expr/impl/src/aggregate/bit_or.rs 0.00% 8 Missing ⚠️
...pr/impl/src/aggregate/approx_count_distinct/mod.rs 0.00% 7 Missing ⚠️
... and 26 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13402      +/-   ##
==========================================
- Coverage   68.31%   68.28%   -0.04%     
==========================================
  Files        1523     1523              
  Lines      261834   261481     -353     
==========================================
- Hits       178879   178555     -324     
+ Misses      82955    82926      -29     
Flag Coverage Δ
rust 68.28% <63.07%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Merged via the queue into main with commit 13ba630 Nov 29, 2023
26 of 27 checks passed
@wangrunji0408 wangrunji0408 deleted the wrj/array-as-list branch November 29, 2023 08:17
Comment on lines 23 to +24
fn unnest(list: ListRef<'_>) -> impl Iterator<Item = Option<ScalarRefImpl<'_>>> {
list.flatten().into_iter()
list.flatten().iter()
Copy link
Member

@xxchan xxchan Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still suboptimal? We are still iterating over the list slices. Can we directly reinterpret the whole ListArray to the underlying ArrayImpl without iteration? 🤔

(BTW, I come here when I was thinking about expressions for the map type and found we don't seem to have vectorized exprs for list also. sth like https://github.com/apache/datafusion/pull/7825/files#diff-1cddf6dfddea9d7c2e299e973bf915176256222281028f2571b1662ed92cd58dR190-R195, other examples include our array_positions, which can be vectorized but not

Copy link
Member

@xxchan xxchan Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ListArray::map_inner (which is only used to implement ArrayTransformExpression) looks like what I mean.

Copy link
Member

@BugenZhao BugenZhao Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still iterating over the list slices. Can we directly reinterpret the whole ListArray to the underlying ArrayImpl without iteration? 🤔

I don't quite get it. Can you elaborate more?


Oh, you're referring to vectorized execution of unnest. The optimization seems too specific because it requires unnest to be the only expression in the ProjectSet operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see map_inner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_inner manipulates the underlying entires child array of the list array directly. Generally, I feel we haven't utilize such optimization a lot. Not specific to unnest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to unnest.

I wasn't specifically talking about unnest. I'm referring to the vectorized execution of any set-returning functions for list type. Not manipulating the offsets array only works if we don't need to know the output length of each list scalar, which requires that there's no other set-returning or scalar functions in the same ProjectSet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, actually I was only thinking about scalar functions.

But I'm not sure I understand what you mean. I feel it also work for ProjectSet:

Specifically, TableFunction will return a chunk with 2 columns (row_idx, and values ).

/// ```text
/// select generate_series(1, x) from t(x);
///
/// # input chunk output chunks (TableFunction) output chunks (ProjectSet)
/// 1 --------------> 0 1 -------------------------> 0 1
/// 2 --------------> 1 1 -------------------------> 0 1
/// 3 ----┐ --- ---
/// │ 1 2 1 2
/// └---------> 2 1 -------------------------> 0 1
/// --- ---
/// 2 2 1 2
/// 2 3 2 3
/// row idx--^ ^--values projected_row_id--^ ^--values
/// ```

For unnest:

    /// # input chunk     output chunks
    /// [1,2]   --------------> 0 1
    /// [3,4,5] -----┐          0 2
    ///              │    
    ///              └--------> 1 3
    ///                         1 4
    ///                         1 5
    ///                row idx--^ ^--values

To build row_idx, we still need to iterate over list_array.iter() though, but values seems to be exactly the same as list_array.inner, so we can directly clone it. In the current implementation, we have an extra copy when building values.

Copy link
Contributor Author

@wangrunji0408 wangrunji0408 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still suboptimal? We are still iterating over the list slices. Can we directly reinterpret the whole ListArray to the underlying ArrayImpl without iteration? 🤔

Yes, of course. However, this specialization has to be done with #[build_function]. The current implementation is more obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but values seems to be exactly the same as list_array.inner, so we can directly clone it. In the current implementation, we have an extra copy when building values.

You're right. There's indeed an extra allocation. But anyhow we still have to do copying when reconstructing the output for a ProjectSet.

if let Some(chunk) = builder.append_row(op, &*row) {

On the other hand, if we can have an owned version of ListRef (like holding an ArrayRef instead of &'a ArrayImpl) and directly let eval returning an iterator of ListRef, we can also achieve zero-copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants