Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Utf8View for bit_length kernel #6671

Merged
merged 9 commits into from
Nov 5, 2024

Conversation

austin362667
Copy link
Contributor

Which issue does this PR close?

Closes apache/datafusion#13195

Rationale for this change

Thanks to @jayzhan211 , he noticed following issue, array compute kernel bit_length() doesn't support Utf8View type:

create table test_source as values
  ('Andrew', 'X'),
  ('Xiangpeng', 'Xiangpeng'),
  ('Raphael', 'R'),
  (NULL, 'R');
create table test as
SELECT
  arrow_cast(column1, 'Utf8View') as column1_utf8view
FROM test_source;

select bit_length(column1_utf8view) from test;

caused the error:

query error DataFusion error: Arrow error: Compute error: bit_length not supported for Utf8View

What changes are included in this PR?

Update bit_length() array function to support Utf8View

Are there any user-facing changes?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a test, which I think will show this not working

arrow-string/src/length.rs Outdated Show resolved Hide resolved
Signed-off-by: Austin Liu <[email protected]>

Add test & handle view bytes length counting

Signed-off-by: Austin Liu <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @austin362667 -- this is looking good. I started the CI and left some comments

arrow-string/src/length.rs Outdated Show resolved Hide resolved
arrow-string/src/length.rs Outdated Show resolved Hide resolved
Signed-off-by: Austin Liu <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @austin362667

let bit_lengths = list
.views()
.iter()
.map(|view| (*view as i32) * 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the length is currently treated as a u32 elsewhere in the code

However the actual Arrow spec I think says use i32 so this is probably fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also still do think you need to handle the null cases (check for which slots are null) as there might be a view of a non zero length there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Thanks @alamb for review

arrow-string/src/length.rs Outdated Show resolved Hide resolved
Comment on lines 145 to 153
let values = list
.views()
.iter()
.enumerate()
.map(|(i, _)| {
if list.is_valid(i) {
list.views()[i] as u32 * 8
} else {
0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #6662 (comment) it was suggested to use from_unary.
It this applicable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Thanks for bringing this up. I tried it and it works trivially well.

Copy link
Contributor

@tustvold tustvold Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using from_unary reverts the optimisation of only inspecting the views, and not the string data.

Perhaps you might use the example I wrote in #6671 (comment)

arrow-string/src/length.rs Outdated Show resolved Hide resolved
@@ -440,7 +465,7 @@ mod tests {
let array = StringArray::from(input);
let result = bit_length(&array).unwrap();
assert_eq!(len, result.len());
let result = result.as_any().downcast_ref::<Int32Array>().unwrap();
let result = result.as_any().downcast_ref::<UInt32Array>().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that bit_length should not beed to return signed integer, but it seems better not to change this within this PR.

arrow-string/src/length.rs Show resolved Hide resolved
Signed-off-by: Austin Liu <[email protected]>
Comment on lines 143 to 153
let values = (0..list.len())
.map(|i| {
if list.is_valid(i) {
list.views()[i] as u32 * 8
} else {
0
}
})
.collect::<Vec<u32>>();
Ok(Arc::new(UInt32Array::new(
values.into(),
Copy link
Contributor

@tustvold tustvold Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let values = (0..list.len())
.map(|i| {
if list.is_valid(i) {
list.views()[i] as u32 * 8
} else {
0
}
})
.collect::<Vec<u32>>();
Ok(Arc::new(UInt32Array::new(
values.into(),
let values = list.views().iter().map(|view| (view as i32).wrapping_mul(8)).collect();
Ok(Arc::new(Int32Array::new(
values,

As we're doing infallible wrapping arithmetic, we can just compute this for everything, this will vectorize better, and the null mask is preserved below

Copy link
Contributor Author

@austin362667 austin362667 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Thank you so much @tustvold , you are indeed making me a better Rust programmer. Really appreciate for review. 👍

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be further simplified, I also think we should return Int32Array for consistency with the other implementations.

Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
@austin362667
Copy link
Contributor Author

Thanks @tustvold @alamb @findepi for the review.

  • I roll back to i32 rather than u32 to make code consistent.
  • Use from_unary suggested from @findepi
  • Add bit_length_null_utf8view test cases suggested by @tustvold .

@tustvold
Copy link
Contributor

tustvold commented Nov 4, 2024

Switching to from_unary has reverted the optimisation to not inspect the string data - #6671 (comment) is how to do this correctly

@austin362667
Copy link
Contributor Author

Got it! Thanks @tustvold I replaced from_unary with the approach you suggested.

Signed-off-by: Austin Liu <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @austin362667 and @tustvold and @findepi

@alamb alamb changed the title Support Utf8View for string function bit_length() Support Utf8View for bit_length kernel Nov 5, 2024
@alamb alamb merged commit 350ea26 into apache:master Nov 5, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Utf8View for string function bit_length
4 participants