-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(type): refactor ListValue
using array
#13402
Conversation
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
I think most operations are operated on |
Yes. The main purpose of this change is to ensure the same type for all elements. And it seems natural to reuse arrays in this case. There are 2 expressions (some-all and array-transform) that may benefit from the union on |
This sounds reasonable to me. I think it's ok to move on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many changes. Generally LGTM
Instead of
|
Do you mean
It's possible. But it adds complexity and I'm not sure if it is worthy. 🥵 Anyway, considering that "most operations are operated on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great to me!!
impl FromIterator<bool> for ListValue { | ||
fn from_iter<I: IntoIterator<Item = bool>>(iter: I) -> Self { | ||
Self::new(iter.into_iter().collect::<BoolArray>().into()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to implement on generic scalar types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. But I have a feeling that it requires an advanced type exercise. 🤪
impl FromIterator<ListValue> for ListValue { | ||
fn from_iter<I: IntoIterator<Item = ListValue>>(iter: I) -> Self { | ||
Self::new(iter.into_iter().collect::<ListArray>().into()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be possibly misused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I think it's clear that collecting a list of lists produces a list of lists. 😇
Nope. I mean enum ArrayImpl {
I16(Box<I16Array>)
...
}
That makes sense. 👍 But it seems like we're heavily relying on
|
Not recommended as it brings another indirection and malloc to other
Yes I'll do that in #13263. |
Signed-off-by: Runji Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest looks great to me!
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Signed-off-by: Runji Wang <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13402 +/- ##
==========================================
- Coverage 68.31% 68.28% -0.04%
==========================================
Files 1523 1523
Lines 261834 261481 -353
==========================================
- Hits 178879 178555 -324
+ Misses 82955 82926 -29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
fn unnest(list: ListRef<'_>) -> impl Iterator<Item = Option<ScalarRefImpl<'_>>> { | ||
list.flatten().into_iter() | ||
list.flatten().iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still suboptimal? We are still iterating over the list slices. Can we directly reinterpret the whole ListArray
to the underlying ArrayImpl
without iteration? 🤔
(BTW, I come here when I was thinking about expressions for the map
type and found we don't seem to have vectorized exprs for list
also. sth like https://github.com/apache/datafusion/pull/7825/files#diff-1cddf6dfddea9d7c2e299e973bf915176256222281028f2571b1662ed92cd58dR190-R195, other examples include our array_positions
, which can be vectorized but not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ListArray::map_inner
(which is only used to implement ArrayTransformExpression
) looks like what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still iterating over the list slices. Can we directly reinterpret the whole
ListArray
to the underlyingArrayImpl
without iteration? 🤔
I don't quite get it. Can you elaborate more?
Oh, you're referring to vectorized execution of unnest
. The optimization seems too specific because it requires unnest
to be the only expression in the ProjectSet
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see map_inner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map_inner
manipulates the underlying entires
child array of the list array directly. Generally, I feel we haven't utilize such optimization a lot. Not specific to unnest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specific to
unnest
.
I wasn't specifically talking about unnest
. I'm referring to the vectorized execution of any set-returning functions for list
type. Not manipulating the offsets
array only works if we don't need to know the output length of each list
scalar, which requires that there's no other set-returning or scalar functions in the same ProjectSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, actually I was only thinking about scalar functions.
But I'm not sure I understand what you mean. I feel it also work for ProjectSet
:
Specifically, TableFunction
will return a chunk
with 2 columns (row_idx
, and values
).
risingwave/src/expr/core/src/table_function/mod.rs
Lines 82 to 95 in dcb565e
/// ```text | |
/// select generate_series(1, x) from t(x); | |
/// | |
/// # input chunk output chunks (TableFunction) output chunks (ProjectSet) | |
/// 1 --------------> 0 1 -------------------------> 0 1 | |
/// 2 --------------> 1 1 -------------------------> 0 1 | |
/// 3 ----┐ --- --- | |
/// │ 1 2 1 2 | |
/// └---------> 2 1 -------------------------> 0 1 | |
/// --- --- | |
/// 2 2 1 2 | |
/// 2 3 2 3 | |
/// row idx--^ ^--values projected_row_id--^ ^--values | |
/// ``` |
For unnest
:
/// # input chunk output chunks
/// [1,2] --------------> 0 1
/// [3,4,5] -----┐ 0 2
/// │
/// └--------> 1 3
/// 1 4
/// 1 5
/// row idx--^ ^--values
To build row_idx
, we still need to iterate over list_array.iter()
though, but values
seems to be exactly the same as list_array.inner
, so we can directly clone
it. In the current implementation, we have an extra copy when building values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still suboptimal? We are still iterating over the list slices. Can we directly reinterpret the whole
ListArray
to the underlyingArrayImpl
without iteration? 🤔
Yes, of course. However, this specialization has to be done with #[build_function]
. The current implementation is more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but
values
seems to be exactly the same aslist_array.inner
, so we can directlyclone
it. In the current implementation, we have an extra copy when buildingvalues
.
You're right. There's indeed an extra allocation. But anyhow we still have to do copying when reconstructing the output for a ProjectSet
.
if let Some(chunk) = builder.append_row(op, &*row) { |
On the other hand, if we can have an owned version of ListRef
(like holding an ArrayRef
instead of &'a ArrayImpl
) and directly let eval
returning an iterator of ListRef
, we can also achieve zero-copy.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR changes the inner structure of
ListValue
fromBox<[Datum]>
toArrayImpl
. Now creating aListValue
requires its type in advance, just like array.Pros:
ListRef
can be unified to a slice of an array.unnest
operation naturally becomes a slice of underlying array.varchar[]
values have been significantly reduced. (n -> 3)Cons:
For example,
{1, null} :: bigint[]
currently takes 1 malloc + 48 bytes, and will take 3 mallocs + 96 bytes after this PR.I'm not 100% sure if this refactoring is a positive one. It seems worthwhile if we assume the list is not too small, and there are considerable uses for string lists. 🤔
This PR also:
ArrayImpl
from 88 bytes to 72 bytes.ListValue::from_iter
functions to simplify list construction in unit tests.array_agg
in stateful style, becauseListValue
becomes immutable.Clone
andEstimateSize
for array builders, as required by 3.Checklist
./risedev check
(or alias,./risedev c
)Documentation