Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GenericListViewArray for ListView & LargeListView #5723

Merged
merged 23 commits into from
Oct 1, 2024

Conversation

Kikkon
Copy link
Contributor

@Kikkon Kikkon commented May 5, 2024

Which issue does this PR close?

A part of #5501

Rationale for this change

What changes are included in this PR?

Add GenericListViewArray implement

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 5, 2024
@Kikkon Kikkon marked this pull request as ready for review May 18, 2024 16:20
@Kikkon
Copy link
Contributor Author

Kikkon commented Jun 11, 2024

@tustvold If you have time, I would appreciate it if you could help me review this PR. Thanks!

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, been swamped and then sick, but this is looking almost there

@Kikkon
Copy link
Contributor Author

Kikkon commented Jun 16, 2024

Thank you for taking the time to review my code despite being busy and unwell.

@Kikkon Kikkon requested a review from tustvold June 16, 2024 14:20
@tustvold
Copy link
Contributor

tustvold commented Jun 26, 2024

For future reference, it is helpful for reviewers if you don't force push branches once reviews have been made, this makes it much easier to see what has changed. I'll try to find time to review this later today, but I now have to do a full review again which is quite time consuming on such a large PR

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 1, 2024

For future reference, it is helpful for reviewers if you don't force push branches once reviews have been made, this makes it much easier to see what has changed. I'll try to find time to review this later today, but I now have to do a full review again which is quite time consuming on such a large PR

@tustvold Thank you for your help. I will be more careful about the force push issue in my future commits. I am very sorry.

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 22, 2024

Hi @tustvold , perhaps when you're free this week, you could help me review this. This way, we can continue to move forward with this work. Thank you.

@tustvold
Copy link
Contributor

tustvold commented Jul 22, 2024

I'm afraid I don't have capacity to assist with this initiative further, but perhaps @alamb may do. I am reducing my involvement in arrow-rs as I no longer use it in my day job

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 24, 2024

@alamb Perhaps you can help advance this PR when you have time. 🥸

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions but overall this matches my understanding of list view arrays. Thanks for the work!

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
Comment on lines 52 to 62
impl<OffsetSize: OffsetSizeTrait> Clone for GenericListViewArray<OffsetSize> {
fn clone(&self) -> Self {
Self {
data_type: self.data_type.clone(),
nulls: self.nulls.clone(),
values: self.values.clone(),
value_offsets: self.value_offsets.clone(),
value_sizes: self.value_sizes.clone(),
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you derive the Clone impl?

Comment on lines 111 to 114
for i in 0..offsets.len() {
let length = values.len();
let offset = offsets[i].as_usize();
let size = sizes[i].as_usize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, is it faster to use offsets.iter().zip(sizes) (maybe the optimize down to the same thing?)

///
/// * `offsets.len() != sizes.len()`
/// * `offsets.len() != nulls.len()`
/// * `offsets.last() > values.len()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies you only check the last offset but you validate all offsets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code again and realized that I had already validated all the offsets, so I changed it to offsets[i] > values.len().

length
)));
}
if offset.saturating_add(size) > values.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the above length comparison if you also have this comparison?

let mut sizes = Vec::with_capacity(iter.size_hint().0);
for size in iter {
offsets.push(OffsetSize::usize_as(acc));
acc = acc.checked_add(size).expect("usize overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor corner case nit: Imagine we are working with i32 offsets and we have 1Gi - 1 fixed size lists of size 2. Will this panic even thought it shouldn't.

acc = acc.checked_add(size).expect("usize overflow");
sizes.push(OffsetSize::usize_as(size));
}
OffsetSize::from_usize(acc).expect("offset overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this final check? It seems redundant.

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
Comment on lines 465 to 460
// SAFETY:
// ArrayData is valid, and verified type above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SAFETY? We aren't doing anything unsafe here that I can see.

let sizes = list.value_sizes();
assert_eq!(sizes, &[3, 3, 3]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other tests might be:

  • Test where the lists overlap (e.g. sizes = [5, 5] and offsets = [0, 3] and values has length 8.
  • Test where offsets does not cover the entire range of values (e.g. values has 50 things and offsets only go up to 10)
  • List of empty lists (with an empty values)

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 31, 2024

hi @westonpace please help me review again

@Kikkon Kikkon requested a review from westonpace July 31, 2024 16:37
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few last small things.

for (offset, size) in offsets.iter().zip(sizes.iter()) {
let offset = offset.as_usize();
let size = size.as_usize();
if offset.saturating_add(size) > values.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a checked_add.

Comment on lines 396 to 409
for size in iter {
offsets.push(OffsetSize::usize_as(acc));
acc = acc.saturating_add(size);
sizes.push(OffsetSize::usize_as(size));
}
Copy link
Member

@westonpace westonpace Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like that you were checking for overflow here, my only concern was that it was in the wrong spot. What if, before the for loop, you checked to ensure that value.len() is less than or equal to the max offset? Then you can keep using saturating_add here. Or even better, use the basic add since we've verified it will never overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reconsidered, and it seems that no additional validation is needed here. 🤔

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@Kikkon
Copy link
Contributor Author

Kikkon commented Sep 21, 2024

@westonpace 😭 Apologies for the delay in processing this pull request due to work reasons. I hope you can help review it when you have time.

@Kikkon Kikkon requested a review from westonpace September 21, 2024 12:50
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing but I think this is good to go. Thanks for addressing the earlier reviews :)

Comment on lines 406 to 404
//check if the size is usize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure I understand the meaning of this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is no longer needed, so I've removed it.

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
self.values.slice(offset, length)
}

/// Returns ith value of this list view array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should get a # Panics note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrobbel Thank you for your review.

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
Kikkon and others added 21 commits September 29, 2024 21:07
@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

I took a quick skim of this PR and the code, tests and documentation are looking quite nice I think -- thank you @Kikkon @westonpace and @mbrobbel

🚀

@alamb alamb merged commit fc05eca into apache:master Oct 1, 2024
26 checks passed
@Kikkon Kikkon deleted the list_view2 branch October 12, 2024 09:43
@Kikkon Kikkon restored the list_view2 branch October 12, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants