-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matrix support ? #691
Comments
(1) For this sort of thing, you should use an extension type as that is the standard way to "layer" new semantics and type metadata onto an existing Arrow type. (2), does the "fixed shape tensor" extension type suit your needs? |
Tensors had come up in a past life, albeit with the caveat that what was out there was 'early and raw'. But thanks for sending me on my way here. I will take a good look at extension types. No issue here, so closing. Thanks again! |
Thanks for the re-open, Dewey. After spitballing the two references by David it seems we could do something here in nanoarrow around its interfaces. Thoughts? |
Reopening because matrix support is a great idea! Unfortunately the column-majorness of matrices in R makes some of the ways we can deal with this not be zero copy, but I think we should still track this even if we don't get to it right away. For a matrix column in an Array, I think the natural mapping would be a mat <- matrix(1:6, ncol = 2, byrow = TRUE)
infer_nanoarrow_type(mat) # would be na_fixed_size_list(na_int32())
array <- as_nanoarrow_array(mat) # would be logically [[1, 2], [3, 4], [5, 6]]
# We already have a mapping for list-like things, so the default would still come back as a list()
convert_array(array). # -> list(1:2, 3:4, 5:6)
# ...but we can request a type from convert_array(), so we can implement the conversion
convert_array(array, matrix(integer()) # -> matrix(1:6, ncol = 2, byrow = TRUE) The "fixed shape tensor" extension would perhaps map to an |
At first glance, that seems expensive. My (initial) thinking was that i) vectors are great, and (nano)array arrays are vectors (in the simple instance of a contiguous array) and ii) most linear algebra consumers take such a vector and merrily form a matrix from it, often zero-copy. So from But you all know the wider arrow landscape better than I do so please shoot holes into this at earliest convenience... |
Ah! For that, we definitely do want the extension type. What we have there would be: mat_list <- list(matrix(1:6, ncol = 2, byrow = TRUE))
as_nanoarrow_array(mat_list, na_fixed_size_tensor(na_int32(), shape = c(2L, 3L), permutation = c(1L, 0L)) ...which would be zero copy (limited to a "single element", which is, of course, a lot of elements they're just in the matrix part). That's impossible enough to type that maybe we just want a function for that. |
Still needs some testing on the stream case, and is unfortunately not very zero copy; however, gets the job done (and I think fixes some cases where we would have otherwise silently handled a matrix as the storage type). Inspired by #691! ``` r library(nanoarrow) df <- data.frame(x = 1:10) df$matrix_col <- matrix(letters[1:20], ncol = 2, byrow = TRUE) array <- as_nanoarrow_array(df) # Default comes back as list_of(character()) convert_array(array) |> tibble::as_tibble() #> # A tibble: 10 × 2 #> x matrix_col #> <int> <list<chr>> #> 1 1 [2] #> 2 2 [2] #> 3 3 [2] #> 4 4 [2] #> 5 5 [2] #> 6 6 [2] #> 7 7 [2] #> 8 8 [2] #> 9 9 [2] #> 10 10 [2] # But can specify matrix convert_array( array, tibble::tibble(x = integer(), matrix_col = matrix(character(), ncol = 2)) ) #> # A tibble: 10 × 2 #> x matrix_col[,1] [,2] #> <int> <chr> <chr> #> 1 1 a b #> 2 2 c d #> 3 3 e f #> 4 4 g h #> 5 5 i j #> 6 6 k l #> 7 7 m n #> 8 8 o p #> 9 9 q r #> 10 10 s t ``` <sup>Created on 2024-12-12 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>
I am contemplating doing something where I may need / want (dense) matrix support, and I have been thinking that an approach similar to what R does (i.e. have it be a vector that happens to have a dimension attribute) may be an option by sticking a dimension vector into the
void* private_data
slot. This seems both 'hackish' yet somewhat obvious.Or am I missing existing prior work elsewhere? Has there really not been any other work where matrices are being passed around via the arrow interface? Thanks in advance for any pointers, or even just a blank 'you are nuts and here is why ...'.
The text was updated successfully, but these errors were encountered: