-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: bulk distributed access #30
Comments
@CarlKCarlK that's great to know! That looks very useful. I will dig in and try it out soon. |
That should be done by end of this month (August). |
@tomwhite I'm trying to figure out a nice way to provide bulk access. As you probably know, the probabilities matrix of genotype 0 maybe have different number of columns of genotype 1 as the ploidy is allowed to vary. Not only that but a genotype is (in bgen terms) defined in terms of probabilities as well as being phased, ploidy, missingness. In [6]: genotype[0].compute()
Out[6]:
{'probs': array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
...,
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([False, False, False, ..., False, False, False])}
In [7]: genotype[1].compute()
Out[7]:
{'probs': array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
...,
[ 1., 0., 0.],
[ 1., 0., 0.],
[nan, nan, nan]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([ True, True, True, ..., False, False, True])} Without trying to be fancy, a quick way to implement bulk access is by indexing genotype by its "bulk index" instead of its genotype index. Something like that: In [6]: bulk[0].compute().genotype[0]
Out[6]:
{'probs': array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
...,
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([False, False, False, ..., False, False, False])}
In [7]: bulk[0].compute().genotype[1]
Out[7]:
{'probs': array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
...,
[ 1., 0., 0.],
[ 1., 0., 0.],
[nan, nan, nan]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([ True, True, True, ..., False, False, True])} Does it appeal you? |
@CarlKCarlK pointed out that he already have bulk access implemented (in the numpy-like interface, accessed via open_bgen). I'm also interested in having that in the dask interface: via read_bgen |
Just want to ensure @eric-czech is in this conversation too as he and Carl have been in discussion separately. |
Hey @CarlKCarlK, I tested this on a single UKB file and had some results to share. I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test. The file is only 4.6G and here are the times I got on a single core:
I did a little more profiling on the new beta in bgen_reader2_prof.ipynb, though this only loads 10k variants of 45,906 to speed things up. I found that in reading these 10k variants after deleting the metafile, I see resource utilization like this: The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable. Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that? A few smaller questions:
|
If the progress prints are coming from C++, https://github.com/p-ranav/indicators is an option. |
Eric,
Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:
* Open that variant
* If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files)
* Read the probability data
* Copy results into output array
* (If samples are also being selected, do that, too)
* Close the variant.
The Open, Read, and Close methods are @Danilo Horta<mailto:[email protected]>’s C/C++ code.
I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.
* Carl
p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>..
From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py
with _log_in_place("reading", self._verbose) as updater:
for out_index, vaddr0 in enumerate(vaddr):
if out_index % vaddr_per_second == 0:
updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))
genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)
if return_probabilities:
if (
prob_buffer is None
or ncombinations[out_index] != prob_buffer.shape[-1]
):
prob_buffer = np.full( # LATER could offer an option to memmap this buffer
(self.nsamples, ncombinations[out_index]),
np.nan,
order="C",
dtype="float64",
)
lib.bgen_genotype_read(
genotype, ffi.cast("double *", prob_buffer.ctypes.data)
)
val[:, out_index, : ncombinations[out_index]] = (
prob_buffer
if (samples_index is self._sample_range)
else prob_buffer[samples_index, :]
)
if return_missings:
[…]
if return_ploidies:
[…]
lib.bgen_genotype_close(genotype)
From: Jeff Hammerbacher <[email protected]>
Sent: Monday, August 10, 2020 12:39 PM
To: limix/bgen-reader-py <[email protected]>
Cc: Carl Kadie <[email protected]>; Mention <[email protected]>
Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
If the progress prints are coming from C++, https://github.com/p-ranav/indicators<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0> is an option.
—
You are receiving this because you were mentioned.
Reply to this email directly
From: Eric Czech <[email protected]>
Sent: Monday, August 10, 2020 12:08 PM
To: limix/bgen-reader-py <[email protected]>
Cc: Carl Kadie <[email protected]>; Mention <[email protected]>
Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Hey @CarlKCarlK<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0>, I tested this on a single UKB file and had some results to share.
I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0>.
The file is only 4.6G and here are the times I got on a single core:
* PyBGEN: ~28 minutes (the .bgi file already exists)
* bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
* bgen_reader2: ~34 minutes not including the metadata construction time
I did a little more profiling on the new beta in bgen_reader2_prof.ipynb<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0>. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:
[prof]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0>
The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.
Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?
A few smaller questions:
* Have you ever used tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000
reading -- time=0:00:37.07, part 990 of 1,000
* I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0>.
|
Update:
Danilo has ideas on making it faster with C/C++ batch readers.
I did the profiling:
84% of time was in the current read, 12% in the open. Everything else was a small %.
* Carl
p.s. Notes to myself:
* python -m cProfile -o test.profile test_bgen2.py
* snakeviz test.profile
From: Carl KADIE
Sent: Monday, August 10, 2020 1:31 PM
To: limix/bgen-reader-py <[email protected]>; limix/bgen-reader-py <[email protected]>; Danilo Horta <[email protected]>
Cc: Mention <[email protected]>
Subject: RE: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Eric,
Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:
* Open that variant
* If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files)
* Read the probability data
* Copy results into output array
* (If samples are also being selected, do that, too)
* Close the variant.
The Open, Read, and Close methods are @Danilo Horta<mailto:[email protected]>’s C/C++ code.
I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.
* Carl
p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>..
From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py
with _log_in_place("reading", self._verbose) as updater:
for out_index, vaddr0 in enumerate(vaddr):
if out_index % vaddr_per_second == 0:
updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))
genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)
if return_probabilities:
if (
prob_buffer is None
or ncombinations[out_index] != prob_buffer.shape[-1]
):
prob_buffer = np.full( # LATER could offer an option to memmap this buffer
(self.nsamples, ncombinations[out_index]),
np.nan,
order="C",
dtype="float64",
)
lib.bgen_genotype_read(
genotype, ffi.cast("double *", prob_buffer.ctypes.data)
)
val[:, out_index, : ncombinations[out_index]] = (
prob_buffer
if (samples_index is self._sample_range)
else prob_buffer[samples_index, :]
)
if return_missings:
[…]
if return_ploidies:
[…]
lib.bgen_genotype_close(genotype)
From: Jeff Hammerbacher <[email protected]<mailto:[email protected]>>
Sent: Monday, August 10, 2020 12:39 PM
To: limix/bgen-reader-py <[email protected]<mailto:[email protected]>>
Cc: Carl Kadie <[email protected]<mailto:[email protected]>>; Mention <[email protected]<mailto:[email protected]>>
Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
If the progress prints are coming from C++, https://github.com/p-ranav/indicators<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0> is an option.
—
You are receiving this because you were mentioned.
Reply to this email directly
From: Eric Czech <[email protected]<mailto:[email protected]>>
Sent: Monday, August 10, 2020 12:08 PM
To: limix/bgen-reader-py <[email protected]<mailto:[email protected]>>
Cc: Carl Kadie <[email protected]<mailto:[email protected]>>; Mention <[email protected]<mailto:[email protected]>>
Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Hey @CarlKCarlK<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0>, I tested this on a single UKB file and had some results to share.
I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0>.
The file is only 4.6G and here are the times I got on a single core:
* PyBGEN: ~28 minutes (the .bgi file already exists)
* bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
* bgen_reader2: ~34 minutes not including the metadata construction time
I did a little more profiling on the new beta in bgen_reader2_prof.ipynb<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0>. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:
[prof]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0>
The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.
Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?
A few smaller questions:
* Have you ever used tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000
reading -- time=0:00:37.07, part 990 of 1,000
* I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0>.
|
Thanks for the benchmark, @eric-czech ! I have found some bootlenecks on the bgen-reader. Here is a small comparison now: % python pybgen_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
0%| | 778/220000 [00:18<1:28:18, 41.37it/s] % python bgen_reader_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
0%| | 483/220000 [00:10<1:22:19, 44.44it/s] I have stopped more or less after 10 seconds just to have an idea. I modified your import fire
import tqdm
from pybgen import PyBGEN
def run(path):
ct = 0
with PyBGEN(path) as bgen:
n = bgen.nb_variants
for info, dosage in tqdm.tqdm(bgen, total=n):
assert dosage.ndim == 1
ct += dosage.size
print(f"Number of entries read: {ct}")
fire.Fire() I will try to release tonight a version 4.0.5 for bgen-reader with that speed-up. It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it). It will come with the Carl API as well, which might be in the state of experimental but I think that is fine for now. Notice that I've not used bulk reading (nor have implemented it yet), although reading the variants in order using |
@hammer Thanks for the pointer. I didnt know about that library. I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere |
bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/ It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface. |
Good to know, thanks @CarlKCarlK.
Beautiful, appreciate the quick turnaround @horta! I'll give that a try again soon. |
Definitely! Here are the times I got with the new code in context with the previous times:
Here is the resource utilization profile I saw for bgen_reader 4.0.5 too: The low memory usage there is great! I know that seems to imply CPU usage was also 100% but it's actually 100% out of 400% (4 cores on the VM), so it seemed like there wasn't any real benefit to the threading via dask delayed, assuming that's how it still works. I verified by running a big random matrix multiplication which produces a similar graph at 400% CPU utilization across the board. All the related code is in bgen_reader_prof.ipynb. I'm assuming that the GIL is the culprit with the default Dask scheduler though I'm curious @horta if you did anything special to see higher concurrency when you switched to Dask in the past? I could try with scheduler='processes' but my experience with that is that it usually ends up being pretty close to a wash with scheduler='threads' because of all the serialization for inter-process communication. Let me know if you have any advice. EDIT: I tried it with the multiprocessing scheduler instead and it was very slow (on track for ~11 hrs) but that's not very fair because I'm asking it to send all the data back to the driver process rather than some reduction over it. It would be great if Dask didn't have this inter-process communication bottleneck -- it keeps coming up! |
Also out of curiosity @horta I was trying to find some commits related to your changes that made such a huge improvement in 4.0.5. Was there one you think would be good look at to highlight the difference? I was just hoping to get a bit more familiar with how it all works internally. Thanks again for whatever you did! |
This one: b40fc36 bgen-reader-py/bgen_reader/_genotype.py Line 42 in eceda00
Everytime the function There are basically only two overheads that, if minimized, would make everything even faster:
If you have a single core to use and want to process the whole bgen file as fast as possible, it would suffice to do something like:
You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py I believe Carl's interface try to work along those lines but using memmap from numpy. |
You will find examples of using |
👍
Awesome! Ok, I will give that a spin -- thanks for the examples! |
Hey @horta I tried using As an experiment I was going to try to lift the file open/close logic out of _bgen_file.py#L70-L71 into my own loop but that's a bit out of my depth. I wasn't sure what the offset should be in: def read_genotype(self, offset: int) -> Genotype:
gt: CData = lib.bgen_file_open_genotype(self._bgen_file, offset)
...
lib.bgen_genotype_close(gt) Is there a fairly easy way to do that? |
Oh and FYI here is a resource profile I had in using cbgen to read a UKB contig: This is from using a cbgen_reader.py wrapper I made to potentially replace the sgkit-bgen#bgen_reader.py wrapper and I thought the sawtooth memory usage was odd. I didn't see that earlier in #30 (comment) when reading the variants in order but I'm not sure what is being "built up" in this case. The wrapper is pre-allocating result arrays so the memory usage shouldn't be so incremental. Might there be anything in |
Hey @eric-czech , the offset for each variant is the result of There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant That is one of the reasons index files exist for bgen files. |
Regarding memory, I don't know. Isn't that just garbage collection by Python? (you can try to use |
Is it not possible to:
I have no idea how the "Advance the open file" part would work in cbgen -- is that more difficult than it sounds?
Mm well what I mean is Dask asks us to fetch some block of adjacent variants (say 100 of them), and there guarantees on which order it will ask for the blocks but they do typically go in order. I.e. for two workers process/thread 1 fetches variants 0 to 100 while process/thread 2 fetches variants 100 to 200 then worker 1 does 200 to 300, etc. "Arbitrary blocks of some size" is probably what I should have called it rather than random.
Makes sense. It's probably something in Dask or our wrapper then -- I saw a similar pattern when making a Dask wrapper around pybgen too. |
I've created an issue to perform "advance the open file". I think it won't change anything: opening variant genotype in sequence should have the same effect as "advancing the open file". Important bit of the C library for that: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145 I indeed perform a |
Just to clarify: after you close an openened variant genotype |
Ah ok, thanks @horta. The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB). Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:
Let me know if that doesn't make sense. |
Another way of tackling this would be to convert to an intermediate chunked format (like zarr). This would be done once, then analysis would be performed on the intermediate format, which would have efficient, chunked read patterns. Perhaps this could be something we recommend for large numbers of samples? This is the way I am doing it for VCF, which is the only real option as we don't have offset information for the n-th variant, so we need to convert to something like zarr anyway. |
That's basically what I'm after, a more efficient way to do that conversion. Are you talking about some process for it that doesn't use Dask? I'd like to think of this as a one-time conversion that we eat the cost of but even now when I'm using ridiculously wide chunks (~1,000 x ~100,000) to avoid some of the repetitive sample reads, it looks it will cost roughly $300 for the vCPU+memory hours (storage aside). That isn't so bad I suppose but I'd like to get those chunks down to more like 10k x 10k, which would hypothetically cost ~5x as much if the write and read workloads are fairly balanced (IO/network bandwidth aside). |
No, it still uses Dask, but instead of using I do something similar for VCF, but it should be simpler for BGEN since you can read variant ranges that align with Zarr chunks, which means you can avoid the concat and rechunk operations that are needed for VCF. See https://github.com/tomwhite/sgkit-vcf/blob/master/sgkit_vcf/vcf_reader.py#L206-L221 |
Ahhh I see what you mean. I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on. I'll try a similar |
Exactly. The genotype blocks in BGEN are compressed using zlib or zstd, and so there's no way to avoid decompressing the whole block to only pull out a subset of samples. |
If compressed, the whole structure of probabilities, ploidy, phasedness, missingness, of a single variant that includes all samples has to first me decompressed. So I think the answer to your question is no whenever bgen file is using compression (which I believe will be in most cases.). If it is not compressed, then it seems possible to just read only the probabilities of the samples your are interested in (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html). |
Ops, you already figured out the answer. Sorry, didn't read before answering! |
Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in sgkit-dev/sgkit-bgen#1. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46.) This is a good choice for random access, but may be prohibitive for bulk access.
Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?
The text was updated successfully, but these errors were encountered: