Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary / row helpers #6096

Merged
merged 8 commits into from
Oct 8, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 60 additions & 2 deletions arrow-row/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ use std::sync::Arc;
use arrow_array::cast::*;
use arrow_array::types::ArrowDictionaryKeyType;
use arrow_array::*;
use arrow_buffer::ArrowNativeType;
use arrow_buffer::{ArrowNativeType, OffsetBuffer, ScalarBuffer};
use arrow_data::ArrayDataBuilder;
use arrow_schema::*;
use variable::{decode_binary_view, decode_string_view};
Expand Down Expand Up @@ -738,6 +738,23 @@ impl RowConverter {
}
}

/// Create a new [Rows] instance from the given binary data.
bkirwi marked this conversation as resolved.
Show resolved Hide resolved
pub fn from_binary(&self, array: BinaryArray) -> Rows {
bkirwi marked this conversation as resolved.
Show resolved Hide resolved
assert_eq!(
array.null_count(),
0,
"can't construct Rows instance from array with nulls"
);
Rows {
buffer: array.values().to_vec(),
offsets: array.offsets().into_iter().map(|&i| i.as_usize()).collect(),
config: RowConfig {
fields: Arc::clone(&self.fields),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More for my curiosity than anything but why Arc::clone(&self.fields) instead of self.fields.clone()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people prefer this form because it makes it more explicit that we're just incrementing an arc and not cloning the underlying data. See the clippy lint docs for more: https://rust-lang.github.io/rust-clippy/master/index.html#/clone_on_ref_ptr

I've gotten used to this style, though I do not personally care deeply about it! This codebase seems to use a mix of both.

validate_utf8: true,
},
}
}

/// Convert raw bytes into [`ArrayRef`]
///
/// # Safety
Expand Down Expand Up @@ -868,6 +885,24 @@ impl Rows {
+ self.buffer.len()
+ self.offsets.len() * std::mem::size_of::<usize>()
}

/// Create a [BinaryArray] from the [Rows] data without reallocating the
/// underlying bytes.
pub fn into_binary(self) -> BinaryArray {
assert!(
bkirwi marked this conversation as resolved.
Show resolved Hide resolved
self.buffer.len() <= i32::MAX as usize,
"rows buffer too large"
);
let offsets_scalar = ScalarBuffer::from_iter(self.offsets.into_iter().map(i32::usize_as));
bkirwi marked this conversation as resolved.
Show resolved Hide resolved
// SAFETY: offsets buffer is nonempty, monotonically increasing, and all represent valid indexes into buffer.
unsafe {
BinaryArray::new_unchecked(
OffsetBuffer::new_unchecked(offsets_scalar),
self.buffer.into(),
None,
)
}
}
}

impl<'a> IntoIterator for &'a Rows {
Expand Down Expand Up @@ -948,6 +983,11 @@ impl<'a> Row<'a> {
config: self.config.clone(),
}
}

/// The row's bytes, with the lifetime of the underlying data.
pub fn data(&self) -> &'a [u8] {
self.data
}
}

// Manually derive these as don't wish to include `fields`
Expand Down Expand Up @@ -2271,7 +2311,7 @@ mod tests {

let comparator = LexicographicalComparator::try_new(&sort_columns).unwrap();

let columns = options
let columns: Vec<SortField> = options
.into_iter()
.zip(&arrays)
.map(|(o, a)| SortField::new_with_options(a.data_type().clone(), o))
Expand Down Expand Up @@ -2304,6 +2344,24 @@ mod tests {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}

// Check that we can convert
let rows = rows.into_binary();
let parser = converter.parser();
let back = converter
.convert_rows(rows.iter().map(|b| parser.parse(b.expect("valid bytes"))))
.unwrap();
for (actual, expected) in back.iter().zip(&arrays) {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}

let rows = converter.from_binary(rows);
let back = converter.convert_rows(&rows).unwrap();
for (actual, expected) in back.iter().zip(&arrays) {
actual.to_data().validate_full().unwrap();
dictionary_eq(actual, expected)
}
bkirwi marked this conversation as resolved.
Show resolved Hide resolved
}
}

Expand Down
Loading