Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix support ? #691

Open
eddelbuettel opened this issue Dec 12, 2024 · 6 comments
Open

Matrix support ? #691

eddelbuettel opened this issue Dec 12, 2024 · 6 comments

Comments

@eddelbuettel
Copy link
Contributor

I am contemplating doing something where I may need / want (dense) matrix support, and I have been thinking that an approach similar to what R does (i.e. have it be a vector that happens to have a dimension attribute) may be an option by sticking a dimension vector into the void* private_data slot. This seems both 'hackish' yet somewhat obvious.

Or am I missing existing prior work elsewhere? Has there really not been any other work where matrices are being passed around via the arrow interface? Thanks in advance for any pointers, or even just a blank 'you are nuts and here is why ...'.

@lidavidm
Copy link
Member

(1) For this sort of thing, you should use an extension type as that is the standard way to "layer" new semantics and type metadata onto an existing Arrow type. (2), does the "fixed shape tensor" extension type suit your needs?

@eddelbuettel
Copy link
Contributor Author

Tensors had come up in a past life, albeit with the caveat that what was out there was 'early and raw'.

But thanks for sending me on my way here. I will take a good look at extension types. No issue here, so closing. Thanks again!

@eddelbuettel eddelbuettel closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2024
@paleolimbot paleolimbot reopened this Dec 12, 2024
@eddelbuettel
Copy link
Contributor Author

Thanks for the re-open, Dewey. After spitballing the two references by David it seems we could do something here in nanoarrow around its interfaces. Thoughts?

@paleolimbot
Copy link
Member

Reopening because matrix support is a great idea! Unfortunately the column-majorness of matrices in R makes some of the ways we can deal with this not be zero copy, but I think we should still track this even if we don't get to it right away.

For a matrix column in an Array, I think the natural mapping would be a fixed_size_list() (i.e., each row is a fixed size list of whatever matrix type we're dealing with). Maybe:

mat <- matrix(1:6, ncol = 2, byrow = TRUE)
infer_nanoarrow_type(mat)  # would be na_fixed_size_list(na_int32())
array <- as_nanoarrow_array(mat)  # would be logically [[1, 2], [3, 4], [5, 6]]

# We already have a mapping for list-like things, so the default would still come back as a list()
convert_array(array). # -> list(1:2, 3:4, 5:6)

# ...but we can request a type from convert_array(), so we can implement the conversion
convert_array(array, matrix(integer())  # -> matrix(1:6, ncol = 2, byrow = TRUE)

The "fixed shape tensor" extension would perhaps map to an array(), but unfortunately an array() with two dimensions is indistinguishable from a matrix() in all recent Rs. This is also less common to have an array column, so perhaps we could punt on that for now.

@eddelbuettel
Copy link
Contributor Author

At first glance, that seems expensive. My (initial) thinking was that i) vectors are great, and (nano)array arrays are vectors (in the simple instance of a contiguous array) and ii) most linear algebra consumers take such a vector and merrily form a matrix from it, often zero-copy. So from mat <- matrix(1:6, ncol =2, byrow = TRUE) I really want the vec <- as.vector(mat) and wondering if there is a good way to 'also' attach the the dim <- c(3,2) part in way that makes sense. Metadata may be a way that is simpler and more robust that a schema-in-private-data, my initial idea.

But you all know the wider arrow landscape better than I do so please shoot holes into this at earliest convenience...

@paleolimbot
Copy link
Member

Ah! For that, we definitely do want the extension type. What we have there would be:

mat_list <- list(matrix(1:6, ncol = 2, byrow = TRUE))
as_nanoarrow_array(mat_list, na_fixed_size_tensor(na_int32(), shape = c(2L, 3L), permutation = c(1L, 0L))

...which would be zero copy (limited to a "single element", which is, of course, a lot of elements they're just in the matrix part). That's impossible enough to type that maybe we just want a function for that.

paleolimbot added a commit that referenced this issue Dec 17, 2024
Still needs some testing on the stream case, and is unfortunately not
very zero copy; however, gets the job done (and I think fixes some cases
where we would have otherwise silently handled a matrix as the storage
type).

Inspired by #691!

``` r
library(nanoarrow)

df <- data.frame(x = 1:10)
df$matrix_col <- matrix(letters[1:20], ncol = 2, byrow = TRUE)

array <- as_nanoarrow_array(df)

# Default comes back as list_of(character())
convert_array(array) |> tibble::as_tibble()
#> # A tibble: 10 × 2
#>        x  matrix_col
#>    <int> <list<chr>>
#>  1     1         [2]
#>  2     2         [2]
#>  3     3         [2]
#>  4     4         [2]
#>  5     5         [2]
#>  6     6         [2]
#>  7     7         [2]
#>  8     8         [2]
#>  9     9         [2]
#> 10    10         [2]

# But can specify matrix
convert_array(
  array,
  tibble::tibble(x = integer(), matrix_col = matrix(character(), ncol = 2))
)
#> # A tibble: 10 × 2
#>        x matrix_col[,1] [,2] 
#>    <int> <chr>          <chr>
#>  1     1 a              b    
#>  2     2 c              d    
#>  3     3 e              f    
#>  4     4 g              h    
#>  5     5 i              j    
#>  6     6 k              l    
#>  7     7 m              n    
#>  8     8 o              p    
#>  9     9 q              r    
#> 10    10 s              t
```

<sup>Created on 2024-12-12 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants