-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Utf8View
for bit_length
kernel
#6671
Conversation
Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a test, which I think will show this not working
Signed-off-by: Austin Liu <[email protected]> Add test & handle view bytes length counting Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @austin362667 -- this is looking good. I started the CI and left some comments
Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @austin362667
arrow-string/src/length.rs
Outdated
let bit_lengths = list | ||
.views() | ||
.iter() | ||
.map(|view| (*view as i32) * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the length is currently treated as a u32
elsewhere in the code
However the actual Arrow spec I think says use i32 so this is probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also still do think you need to handle the null cases (check for which slots are null) as there might be a view of a non zero length there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Thanks @alamb for review
Signed-off-by: Austin Liu <[email protected]>
arrow-string/src/length.rs
Outdated
let values = list | ||
.views() | ||
.iter() | ||
.enumerate() | ||
.map(|(i, _)| { | ||
if list.is_valid(i) { | ||
list.views()[i] as u32 * 8 | ||
} else { | ||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #6662 (comment) it was suggested to use from_unary.
It this applicable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi Thanks for bringing this up. I tried it and it works trivially well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using from_unary reverts the optimisation of only inspecting the views, and not the string data.
Perhaps you might use the example I wrote in #6671 (comment)
arrow-string/src/length.rs
Outdated
@@ -440,7 +465,7 @@ mod tests { | |||
let array = StringArray::from(input); | |||
let result = bit_length(&array).unwrap(); | |||
assert_eq!(len, result.len()); | |||
let result = result.as_any().downcast_ref::<Int32Array>().unwrap(); | |||
let result = result.as_any().downcast_ref::<UInt32Array>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that bit_length should not beed to return signed integer, but it seems better not to change this within this PR.
Signed-off-by: Austin Liu <[email protected]>
arrow-string/src/length.rs
Outdated
let values = (0..list.len()) | ||
.map(|i| { | ||
if list.is_valid(i) { | ||
list.views()[i] as u32 * 8 | ||
} else { | ||
0 | ||
} | ||
}) | ||
.collect::<Vec<u32>>(); | ||
Ok(Arc::new(UInt32Array::new( | ||
values.into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let values = (0..list.len()) | |
.map(|i| { | |
if list.is_valid(i) { | |
list.views()[i] as u32 * 8 | |
} else { | |
0 | |
} | |
}) | |
.collect::<Vec<u32>>(); | |
Ok(Arc::new(UInt32Array::new( | |
values.into(), | |
let values = list.views().iter().map(|view| (view as i32).wrapping_mul(8)).collect(); | |
Ok(Arc::new(Int32Array::new( | |
values, |
As we're doing infallible wrapping arithmetic, we can just compute this for everything, this will vectorize better, and the null mask is preserved below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Thank you so much @tustvold , you are indeed making me a better Rust programmer. Really appreciate for review. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be further simplified, I also think we should return Int32Array for consistency with the other implementations.
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Switching to from_unary has reverted the optimisation to not inspect the string data - #6671 (comment) is how to do this correctly |
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @austin362667 and @tustvold and @findepi
Utf8View
for string function bit_length()
Utf8View
for bit_length
kernel
Which issue does this PR close?
Closes apache/datafusion#13195
Rationale for this change
Thanks to @jayzhan211 , he noticed following issue, array compute kernel
bit_length()
doesn't supportUtf8View
type:caused the error:
What changes are included in this PR?
Update
bit_length()
array function to supportUtf8View
Are there any user-facing changes?