Fix read_bgen performance issues #9

eric-czech · 2020-08-18T13:45:45Z

This wasn't happening before updating to bgen-reader 4.0.5, but now I can no longer read files without memory seemingly leaking with no bound. Memory usage when running bgen-reader directly is very low, see limix/bgen-reader-py#30 (comment). I'm trying to read the same file in both cases so I think there must be something going on related to @horta's changes that make what we're doing in this repo problematic now.

eric-czech · 2020-08-18T15:31:02Z

Ahh it looks like this is actually a mistake I made in chunking. I didn't know this, but if you rechunk an array defined with da.from_array, it still (apparently) tries to read the array chunks with the original shape first which in this case is a problem because the reader pulls all the samples into memory before slicing them off. Memory usage is fine if the chunks are passed directly to da.from_array instead.

Now I've just got to figure out why what takes ~20 mins with bgen-reader takes ~2 hrs with our wrapper around it.

eric-czech · 2020-08-20T13:55:01Z

Note: #12 improves memory usage but I'll close this when we can avoid some of the overhead with individual variant reads, as mentioned in limix/bgen-reader-py#30.

eric-czech changed the title ~~Fix read_bgen memory leak~~ Fix read_bgen performance issues Aug 18, 2020

eric-czech mentioned this issue Aug 19, 2020

Preallocate result arrays and apply earlier slicing #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read_bgen performance issues #9

Fix read_bgen performance issues #9

eric-czech commented Aug 18, 2020

eric-czech commented Aug 18, 2020

eric-czech commented Aug 20, 2020

Fix read_bgen performance issues #9

Fix read_bgen performance issues #9

Comments

eric-czech commented Aug 18, 2020

eric-czech commented Aug 18, 2020

eric-czech commented Aug 20, 2020