Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering items in list inside ListArray #6846

Open
rluvaton opened this issue Dec 6, 2024 · 15 comments
Open

Filtering items in list inside ListArray #6846

rluvaton opened this issue Dec 6, 2024 · 15 comments
Labels
question Further information is requested

Comments

@rluvaton
Copy link
Contributor

rluvaton commented Dec 6, 2024

Which part is this question about
library api

Describe your question
I have a ListArray and I want to filter each item in the list

┌─────────────┐    ->    ┌─────────────┐
│   [A,B,C]   │    ->    │    [A,C]    │
├─────────────┤    ->    ├─────────────┤
│     []      │    ->    │     []      │
├─────────────┤    ->    ├─────────────┤
│    NULL     │    ->    │    NULL     │
├─────────────┤    ->    ├─────────────┤
│     [D]     │    ->    │     []      │
├─────────────┤    ->    ├─────────────┤
│  [NULL, F]  │    ->    │  [NULL, F]  │
└─────────────┘    ->    └─────────────┘

But I couldn't find a way to do it

I tried this:

#[cfg(test)]
mod tests {
    use arrow::array::{Array, BooleanArray, ListArray};
    use arrow::datatypes::Int32Type;
    use arrow::error::ArrowError;
    use std::ops::Deref;

    #[test]
    fn list_array() -> Result<(), ArrowError> {
        // Create list
        let list_array = create_list_for_test();

        // go over each list in the array and filter items in it
        let new_list: ListArray = list_array
            .iter()
            .map(|x| {
                match x {
                    None => Ok(None),
                    Some(x) => {
                        let some_predicate = BooleanArray::from((0..x.len()).map(|i| i % 2 == 0).collect::<Vec<_>>());
                        let val = arrow::compute::filter(x.deref(), &some_predicate)?;

                        Ok(Some(val))
                    }
                }
            })
            .collect::<Result<_, _>>()?;

        println!("{:?}", new_list);

        Ok(())
    }

    // This can be list of list of int32 or any other inner list type
    fn create_list_for_test() -> ListArray {
         let data = vec![
            Some(vec![Some(0), Some(1), Some(2)]),
            None,
            Some(vec![Some(3), None, Some(5)]),
            Some(vec![Some(6), Some(7)]),
         ];
         let list_array = ListArray::from_iter_primitive::<Int32Type, _, _>(data);

        list_array
    }
}

But got:

error[E0277]: a value of type `GenericListArray<i32>` cannot be built from an iterator over elements of type `Option<Arc<dyn arrow::array::Array>>`
    --> arrow-pg/src/lib.rs:31:24
     |
31   |             .collect::<Result<_, _>>()?;
     |              -------   ^^^^^^^^^^^^ value of type `GenericListArray<i32>` cannot be built from `std::iter::Iterator<Item=Option<Arc<dyn arrow::array::Array>>>`
     |              |
     |              required by a bound introduced by this call
     |
     = help: the trait `FromIterator<Option<Arc<dyn arrow::array::Array>>>` is not implemented for `GenericListArray<i32>`, which is required by `Result<_, _>: FromIterator<Result<Option<Arc<dyn arrow::array::Array>>, _>>`
     = help: the trait `FromIterator<Result<A, E>>` is implemented for `Result<V, E>`
     = note: required for `Result<GenericListArray<i32>, _>` to implement `FromIterator<Result<Option<Arc<dyn arrow::array::Array>>, _>>`
note: required by a bound in `collect`
    --> /Users/something/.rustup/toolchains/1.80-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2001:19
     |
2001 |     fn collect<B: FromIterator<Self::Item>>(self) -> B
     |                   ^^^^^^^^^^^^^^^^^^^^^^^^ required by this bound in `Iterator::collect`
@rluvaton rluvaton added the question Further information is requested label Dec 6, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 6, 2024

what is your filter logic? If you want a customize filter logic like x % 2 == 0, I think it should be an UDF in datafusion instead of arrow-rs. Btw, we don't have array_filter support in datafusion yet

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 6, 2024

I'm using DataFusion and actually creating array filter Physical Plan

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 6, 2024

But It still should possible to create array list from iterator, no?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 7, 2024

Oh, I see the question you have. I think we can't build the ListArray from ArrayRef. I think the easiest way is to create the list like this

        let data = vec![
            Some(vec![Some(0), Some(1), Some(2)]),
            Some(vec![Some(3), Some(4), Some(5)]),
            Some(vec![Some(6), Some(7)]),
        ];
        let list_array = ListArray::from_iter_primitive::<Int32Type, _, _>(data);

I agree we could make this more easy to use

    let arrays = list_array.iter().map(|x| {
        if let Some(x) = x {
            let some_predicate =
            BooleanArray::from((0..x.len()).map(|i| i % 2 == 0).collect::<Vec<_>>());
            let val = arrow::compute::filter(x.deref(), &some_predicate)?;
            let i32_val = val.as_primitive::<Int32Type>();
            let val = i32_val.values().iter().map(|x| Some(*x)).collect::<Vec<_>>();
            Ok(Some(val))
        } else {
            Ok(None)
        }
    }).collect::<Result<Vec<_>, ArrowError>>()?;

    let new_list = ListArray::from_iter_primitive::<Int32Type, _, _>(arrays);

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 7, 2024

It can also not be primitive, the function is just an example

As you see from my comment

This can be list of list of int32 or any other inner list type

@jayzhan211
Copy link
Contributor

I think we can create utils function to build the list easily (and ideally efficient)

@tustvold
Copy link
Contributor

tustvold commented Dec 8, 2024

If the types are known statically one can use the builder APIs

https://docs.rs/arrow-array/latest/arrow_array/builder/index.html#nested-usage

If the types are not known statically, it gets much more complicated and will require evaluating the filter predicate on the List::values and then using this in combination with the selection kernels to construct a new ListArray.

As a ListArray isn't stored as a list of arrays, there isn't an efficient way to "collect" an iterator of arrays

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 9, 2024

Doing any operation on on list of lists is a pain

If I have list of list of i32 using the builder GenericListBuilder<i32, GenericListBuilder<i32, Int32Builder>> I can't use append_value on the top list to add another list:

    #[test]
    fn should_run() {
        let from: Arc<ListArray> = create_test_list();
        let mut to = ListBuilder::new(
            ListBuilder::new(
                Int32Builder::new()
            )
        );
        let indices: &[usize] = &[0, 1, 2];
        let data_type = DataType::List(
            Arc::new(Field::new(
                "item",
                DataType::List(
                    Arc::new(Field::new(
                        "item",
                        DataType::Int32,
                        false
                    ))
                ),
                false
            ))
        );

        for &i in indices {
            if from.is_valid(i) {
                let inner_list = from.value(i).as_any().downcast_ref::<GenericListArray<i32>>().unwrap();
                to.append_value(inner_list); // <- THIS WILL FAIL TO COMPILE
            } else {
                to.append_null();
            }
        }
    }

    fn create_test_list() -> Arc<ListArray> {
        let primitive_builder = Int32Builder::with_capacity(10);
        let values_builder = ListBuilder::new(primitive_builder);
        let mut builder = ListBuilder::new(values_builder);

        //  [[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], null, [[9, 10]]]
        builder.values().values().append_value(1);
        builder.values().values().append_value(2);
        builder.values().append(true);
        builder.values().values().append_value(3);
        builder.values().values().append_value(4);
        builder.values().append(true);
        builder.append(true);

        builder.values().values().append_value(5);
        builder.values().values().append_value(6);
        builder.values().values().append_value(7);
        builder.values().append(true);
        builder.values().append(false);
        builder.values().values().append_value(8);
        builder.values().append(true);
        builder.append(true);

        builder.append(false);

        builder.values().values().append_value(9);
        builder.values().values().append_value(10);
        builder.values().append(true);
        builder.append(true);

        Arc::new(builder.finish())
    }

It's really a pain

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 9, 2024

I see GreptimeDB implement their own ListVector based on GenericListBuilder that allow them to append list items when moved to use this repo instead of arrow2
https://github.com/GreptimeTeam/greptimedb/blob/ce86ba3425924b83668dccdd6d9811e32b2e3933/src/datatypes/src/vectors/list.rs#L212-L255

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2024

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 9, 2024

Same problem, If the types are not known statically, it's really complicated while it shouldn't be from API interface perspective

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2024

PRs are always welcome if you have ideas on how to improve things

@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 9, 2024

already ahead of you :)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 10, 2024

I see GreptimeDB implement their own ListVector based on GenericListBuilder that allow them to append list items when moved to use this repo instead of arrow2 https://github.com/GreptimeTeam/greptimedb/blob/ce86ba3425924b83668dccdd6d9811e32b2e3933/src/datatypes/src/vectors/list.rs#L212-L255

I agree, complex nested type (list of list, list of struct, struct of struct etc..) is not well supported and easy to use for now. Welcome for the contribution!

@rluvaton
Copy link
Contributor Author

Can you guys take a look at #6863 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants