Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue with subassignment #36

Closed
mikejiang opened this issue Oct 1, 2018 · 6 comments
Closed

Memory issue with subassignment #36

mikejiang opened this issue Oct 1, 2018 · 6 comments

Comments

@mikejiang
Copy link

The new feature introduced by #32 seems to have memory leak issue, as the example shown below the DelayedArray object gets bigger after each iteration of subassignment

library(HDF5Array)

M <- matrix(1:100, nrow=10)
tmp <- tempfile()
h5write(M, tmp, "data")
hm <- HDF5Array(tmp, "data")
object.size(hm)
#> 2144 bytes
set.seed(123)
for(k in 1:10)
{
  i <- sample(10,10)
  j <- sample(10,10)
  hm[i, j][hm[i, j] >0] <- 0
}
  
object.size(hm)
#> 10593240 bytes

Created on 2018-10-01 by the reprex package (v0.2.1)

@PeteHaitch
Copy link
Collaborator

Take a look at showtree(hm) - you build up a large and complicated DelayedArray using this code.
Is this something you actually want to do in practice?

FWIW, I wouldn't call this a memory leak, rather I'd say this subassignment is something you want to avoid doing without realizing the result fairly soon afterward.
Each subassignment has to keep a bunch of stuff in-memory and I think it can't simplify this (although you can of course explicitly realize one of the intermediate results).

@hpages
Copy link
Contributor

hpages commented Oct 2, 2018

As Pete said, this is not memory leak, just growing an object. The more delayed operations a DelayedArray object is going to carry, the bigger the object is going to be in memory.

Delayed ops are represented by DelayedOp objects. There are currently 8 types of DelayedOp (see ?DelayedOp). In general these are small objects except for DelayedSubset and DelayedSubassign objects (which are DelayedOp derivatives for representing subsetting and subassignment, respectively). These objects need to store the index of the subsetting or the "left index" of the subassignment. This index is a list of NULLs or integer vectors, one per dimension in the object, so it can be big.

Furthermore, when various subsetting operations are applied to a DelayedArray object (consecutively like in M[1:2, ][ , -5], or separated by other operations like in log(t(M[1:2, ]))[-5, ]), they can be merged into a single delayed subsetting operation that summarizes them all. This simplification of the tree of delayed operations is performed by simplify(). But there is no such thing for subassignments. These operations will accumulate in the tree of delayed operations carried by the object and cannot in general be merged into a single subassignment operation. Although there are situations where they could be merged, simplify() doesn't detect those situations at the moment (and it's not clear that the cost of detecting them wouldn't outweigh the benefit of the simplification).

To summarize, all delayed operations have 2 costs: (1) the cost of representing them (i.e. the memory footprint of the DelayedOp objects that represent them), and (2) the cost of realizing them. (1) is paid immediately and (2) is paid later at realization time. (1) is generally small but can become significant if the DelayedArray object carries many delayed operations that cannot be simplified. For example, doing M <- 1 / (M^2 + 1) in a loop would also grow the tree of delayed ops carried by M and there would be no way to simplify that tree.

So the advice is: even though a DelayedArray object supports delayed operations, one needs to be careful to not accumulate too many operations on it. Especially if the operations are subassignments, which can be expensive to represent (if the "left index" is big) and cannot be simplified.

Hope this makes sense.

@mikejiang
Copy link
Author

Unfortunately this is the real use case where the matrix needs to be partially updated multiple time within the loop. After the entire process is finished, the memory usage of the DelayedArray object grows to 30G! (I've confirmed that the results are correct though).
I guess the solution will be realize it immediately after each iteration (at cost of computing time). Since I am using Bigmemory as the backend, I wonder if you can start to write a general documentation on how to implement realization. Or you can point me to the sources of HDF5Array that I need to replicate (I guess I need the minimum specifications in order for realize to work)

@PeteHaitch
Copy link
Collaborator

Does Bigmemory support in-place replacement of values? This might be better than delaying the subassignment, for this use case.

@hpages
Copy link
Contributor

hpages commented Oct 2, 2018

It would indeed. Because each realization will copy the entire data set!

Anyway FWIW I've started to put together some notes for how to implement a realization backend here: #37

For your use case, if Bigmemory supports in-place replacement of values and you feel like taking the BigmemRealizationSink route, then you might consider implementing an extract_array() and replace_array() methods for BigmemRealizationSink objects. This would allow you to do all those replacements in-place by using something like this:

sink <- BigmemRealizationSink(dim(M), dimnames(M), type(M))
DelayedArray:::BLOCK_write_to_sink(M, sink)
for (k in ...) {
    ## Make sure that 'i' and 'j' are "short subscripts" so that extract_array() returns
    ## a "small" matrix 'm' (i.e. a matrix that fits in memory) at each iteration.
    i <- ...
    j <- ... 
    m <- extract_array(sink, list(i, j))
    m[m > 0] <- 0
    replace_array(sink, list(i, j), m)
}
close(sink)
M2 <- as(sink, "DelayedArray")

Note that there is no replace_array() generic in DelayedArray so you could either come up with your own generic or just implement this as a regular function.

@hpages
Copy link
Contributor

hpages commented May 20, 2020

@mikejiang Can we close this one too?

@hpages hpages closed this as completed May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants