Feature request: bulk distributed access #30

tomwhite · 2020-07-23T14:00:47Z

Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in sgkit-dev/sgkit-bgen#1. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46.) This is a good choice for random access, but may be prohibitive for bulk access.

Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?

CarlKCarlK · 2020-07-23T18:31:24Z

Tom, Danilo (the lead on this project) and I have a beta for a 2nd API that supports bulk reads. It is as fast as we could make it (reading about 4 million distributions per second). It is based on NumPy rather than Dask, but I assume you could convert any data to Dask as needed. A beta quickstart document is here: https://fastlmm.github.io/PySnpTools/branch/bgen2/index.html. You can install the beta for Linux 3.7 from here (3.8 is where’d you expect. Windows and MacOS available, too.) pip install https://github.com/fastlmm/PySnpTools/releases/download/v4.0.4/bgen_reader-4.0.4-cp37-cp37m-manylinux1_x86_64.whl Let us know if this seems useful for you. * Carl https://github.com/carlkcarlk From: Tom White <[email protected]> Sent: Thursday, July 23, 2020 7:01 AM To: limix/bgen-reader-py <[email protected]> Cc: Subscribed <[email protected]> Subject: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in sgkit-dev/sgkit-bgen#1<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-bgen%2Fpull%2F1&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648889003&sdata=4NOVsKG19gUsgPJFUvAucIb6rKBxA7EY1xvF%2Fhdzogk%3D&reserved=0>. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fblob%2Fmaster%2Fbgen_reader%2F_genotype.py%23L46&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=j7wcgygIfgRrUUSdUzPoK5pZbdsINUjiHrlCTVzCEkw%3D&reserved=0>.) This is a good choice for random access, but may be prohibitive for bulk access. Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=zTNVSdCWbwfmRzy5AgyWeYEYl%2BI2UCJqGd5rPEhqdxw%3D&reserved=0>, or unsubscribe<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5NIAVH5B7ISVNJAOTR5A7B7ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648908993&sdata=GeYp1%2FTheXaZNvd2siDnt7JVvHSWF%2B46bx2i2%2BZ5210%3D&reserved=0>.

tomwhite · 2020-07-24T13:50:45Z

@CarlKCarlK that's great to know! That looks very useful. I will dig in and try it out soon.

horta · 2020-08-01T00:58:25Z

That should be done by end of this month (August).

horta · 2020-08-10T14:51:21Z

@tomwhite I'm trying to figure out a nice way to provide bulk access.

As you probably know, the probabilities matrix of genotype 0 maybe have different number of columns of genotype 1 as the ploidy is allowed to vary. Not only that but a genotype is (in bgen terms) defined in terms of probabilities as well as being phased, ploidy, missingness.

In [6]: genotype[0].compute()
Out[6]:
{'probs': array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        ...,
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([False, False, False, ..., False, False, False])}

In [7]: genotype[1].compute()
Out[7]:
{'probs': array([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        ...,
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [nan, nan, nan]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([ True,  True,  True, ..., False, False,  True])}

Without trying to be fancy, a quick way to implement bulk access is by indexing genotype by its "bulk index" instead of its genotype index. Something like that:

In [6]: bulk[0].compute().genotype[0]
Out[6]:
{'probs': array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        ...,
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([False, False, False, ..., False, False, False])}

In [7]: bulk[0].compute().genotype[1]
Out[7]:
{'probs': array([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        ...,
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [nan, nan, nan]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([ True,  True,  True, ..., False, False,  True])}

Does it appeal you?

horta · 2020-08-10T15:05:02Z

@CarlKCarlK pointed out that he already have bulk access implemented (in the numpy-like interface, accessed via open_bgen). I'm also interested in having that in the dask interface: via read_bgen

hammer · 2020-08-10T15:36:22Z

Just want to ensure @eric-czech is in this conversation too as he and Carl have been in discussion separately.

eric-czech · 2020-08-10T19:08:00Z

Hey @CarlKCarlK, I tested this on a single UKB file and had some results to share.

I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test.

The file is only 4.6G and here are the times I got on a single core:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader2: ~34 minutes not including the metadata construction time

I did a little more profiling on the new beta in bgen_reader2_prof.ipynb, though this only loads 10k variants of 45,906 to speed things up. I found that in reading these 10k variants after deleting the metafile, I see resource utilization like this:

The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.

Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?

A few smaller questions:

Have you ever used tqdm? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.

metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000 
reading -- time=0:00:37.07, part 990 of 1,000

I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?

hammer · 2020-08-10T19:38:26Z

If the progress prints are coming from C++, https://github.com/p-ranav/indicators is an option.

CarlKCarlK · 2020-08-10T20:30:38Z

Eric, Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps: * Open that variant * If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files) * Read the probability data * Copy results into output array * (If samples are also being selected, do that, too) * Close the variant. The Open, Read, and Close methods are @Danilo Horta<mailto:[email protected]>’s C/C++ code. I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read. * Carl p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>.. From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py with _log_in_place("reading", self._verbose) as updater: for out_index, vaddr0 in enumerate(vaddr): if out_index % vaddr_per_second == 0: updater("part {0:,} of {1:,}".format(out_index, len(vaddr))) genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0) if return_probabilities: if ( prob_buffer is None or ncombinations[out_index] != prob_buffer.shape[-1] ): prob_buffer = np.full( # LATER could offer an option to memmap this buffer (self.nsamples, ncombinations[out_index]), np.nan, order="C", dtype="float64", ) lib.bgen_genotype_read( genotype, ffi.cast("double *", prob_buffer.ctypes.data) ) val[:, out_index, : ncombinations[out_index]] = ( prob_buffer if (samples_index is self._sample_range) else prob_buffer[samples_index, :] ) if return_missings: […] if return_ploidies: […] lib.bgen_genotype_close(genotype) From: Jeff Hammerbacher <[email protected]> Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py <[email protected]> Cc: Carl Kadie <[email protected]>; Mention <[email protected]> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) If the progress prints are coming from C++, https://github.com/p-ranav/indicators<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0> is an option. — You are receiving this because you were mentioned. Reply to this email directly From: Eric Czech <[email protected]> Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py <[email protected]> Cc: Carl Kadie <[email protected]>; Mention <[email protected]> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) Hey @CarlKCarlK<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0>, I tested this on a single UKB file and had some results to share. I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0>. The file is only 4.6G and here are the times I got on a single core: * PyBGEN: ~28 minutes (the .bgi file already exists) * bgen_reader: I stopped it after an hour but it was on track to take ~4 hours * bgen_reader2: ~34 minutes not including the metadata construction time I did a little more profiling on the new beta in bgen_reader2_prof.ipynb<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0>. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this: [prof]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0> The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable. Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that? A few smaller questions: * Have you ever used tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e. metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906 Found 10000 variants reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000 reading -- time=0:00:37.07, part 990 of 1,000 * I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0>.

CarlKCarlK · 2020-08-10T21:55:37Z

Update: Danilo has ideas on making it faster with C/C++ batch readers. I did the profiling: 84% of time was in the current read, 12% in the open. Everything else was a small %. * Carl p.s. Notes to myself: * python -m cProfile -o test.profile test_bgen2.py * snakeviz test.profile From: Carl KADIE Sent: Monday, August 10, 2020 1:31 PM To: limix/bgen-reader-py <[email protected]>; limix/bgen-reader-py <[email protected]>; Danilo Horta <[email protected]> Cc: Mention <[email protected]> Subject: RE: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) Eric, Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps: * Open that variant * If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files) * Read the probability data * Copy results into output array * (If samples are also being selected, do that, too) * Close the variant. The Open, Read, and Close methods are @Danilo Horta<mailto:[email protected]>’s C/C++ code. I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read. * Carl p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>.. From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py with _log_in_place("reading", self._verbose) as updater: for out_index, vaddr0 in enumerate(vaddr): if out_index % vaddr_per_second == 0: updater("part {0:,} of {1:,}".format(out_index, len(vaddr))) genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0) if return_probabilities: if ( prob_buffer is None or ncombinations[out_index] != prob_buffer.shape[-1] ): prob_buffer = np.full( # LATER could offer an option to memmap this buffer (self.nsamples, ncombinations[out_index]), np.nan, order="C", dtype="float64", ) lib.bgen_genotype_read( genotype, ffi.cast("double *", prob_buffer.ctypes.data) ) val[:, out_index, : ncombinations[out_index]] = ( prob_buffer if (samples_index is self._sample_range) else prob_buffer[samples_index, :] ) if return_missings: […] if return_ploidies: […] lib.bgen_genotype_close(genotype) From: Jeff Hammerbacher <[email protected]<mailto:[email protected]>> Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py <[email protected]<mailto:[email protected]>> Cc: Carl Kadie <[email protected]<mailto:[email protected]>>; Mention <[email protected]<mailto:[email protected]>> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) If the progress prints are coming from C++, https://github.com/p-ranav/indicators<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0> is an option. — You are receiving this because you were mentioned. Reply to this email directly From: Eric Czech <[email protected]<mailto:[email protected]>> Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py <[email protected]<mailto:[email protected]>> Cc: Carl Kadie <[email protected]<mailto:[email protected]>>; Mention <[email protected]<mailto:[email protected]>> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) Hey @CarlKCarlK<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0>, I tested this on a single UKB file and had some results to share. I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0>. The file is only 4.6G and here are the times I got on a single core: * PyBGEN: ~28 minutes (the .bgi file already exists) * bgen_reader: I stopped it after an hour but it was on track to take ~4 hours * bgen_reader2: ~34 minutes not including the metadata construction time I did a little more profiling on the new beta in bgen_reader2_prof.ipynb<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0>. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this: [prof]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0> The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable. Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that? A few smaller questions: * Have you ever used tqdm<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0>? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e. metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906 Found 10000 variants reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000 reading -- time=0:00:37.07, part 990 of 1,000 * I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0>, or unsubscribe<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0>.

horta · 2020-08-10T21:59:46Z

Thanks for the benchmark, @eric-czech !

I have found some bootlenecks on the bgen-reader. Here is a small comparison now:

% python pybgen_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
  0%|          | 778/220000 [00:18<1:28:18, 41.37it/s]

% python bgen_reader_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
  0%|          | 483/220000 [00:10<1:22:19, 44.44it/s]

I have stopped more or less after 10 seconds just to have an idea.

I modified your pybgen_exec.py only to use tqdm:

import fire
import tqdm
from pybgen import PyBGEN


def run(path):
    ct = 0
    with PyBGEN(path) as bgen:
        n = bgen.nb_variants
        for info, dosage in tqdm.tqdm(bgen, total=n):
            assert dosage.ndim == 1
            ct += dosage.size
    print(f"Number of entries read: {ct}")


fire.Fire()

I will try to release tonight a version 4.0.5 for bgen-reader with that speed-up. It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it).

It will come with the Carl API as well, which might be in the state of experimental but I think that is fine for now.

Notice that I've not used bulk reading (nor have implemented it yet), although reading the variants in order using bgen_reader.read_bgen is faster than in random order.

horta · 2020-08-10T22:01:55Z

@hammer Thanks for the pointer. I didnt know about that library.

I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere
It is very humble compared to the one you showed: ASCII based only.

CarlKCarlK · 2020-08-10T22:09:28Z

Aside: I have my own, too 😊 A single 45-lines Python function called “_log_in_place". Here is a usage example. Notice the “i+1”. That fixes the off-by-one error Eric found. with _log_in_place("metadata", self._verbose) as updater: for i in range(self.nsamples): if i % 1000 == 0: updater("'sample range': part {0:,} of {1:,}".format(i+1, self.nsamples)) sample_range_memmap[i] = i * Carl From: Danilo Horta <[email protected]> Sent: Monday, August 10, 2020 3:02 PM To: limix/bgen-reader-py <[email protected]> Cc: Carl Kadie <[email protected]>; Mention <[email protected]> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30) @hammer<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhammer&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=WZVbsv7QfG6QSjZFnVwEgJMoXwGVOEeRYUZP4V2oS1I%3D&reserved=0> Thanks for the pointer. I didnt know about that library. I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhorta%2Falmosthere&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=J9YQZijOG%2F%2FXAAtc1f9TEMZatyYWBvrUDloorhioLBo%3D&reserved=0> It is very humble compared to the one you showed: ASCII based only. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671611940&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=G0FrivExKUsOtQ1DRMMmVfaVAkJYA9xucTSHF1kTTdc%3D&reserved=0>, or unsubscribe<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65PYV5BIHIW4FEOFXQSTSABU6BANCNFSM4PFYCRGQ&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=qeFoVB0hyQg1kl0TQRHnHewAHfIZNSPAwbHn7ICpVlk%3D&reserved=0>.

horta · 2020-08-11T04:13:24Z

bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/

It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.

eric-czech · 2020-08-11T12:31:47Z

The Open, Read, and Close methods are @Danilo Hortamailto:[email protected]’s C/C++ code.

Good to know, thanks @CarlKCarlK.

It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it).
bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/

Beautiful, appreciate the quick turnaround @horta! I'll give that a try again soon.

eric-czech · 2020-08-17T14:43:04Z

It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.

Definitely! Here are the times I got with the new code in context with the previous times:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader 4.0.4: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader 4.0.5: ~21 minutes including metafile construction time
bgen_reader2 in pysnptools: ~34 minutes not including the metadata construction time

Here is the resource utilization profile I saw for bgen_reader 4.0.5 too:

The low memory usage there is great! I know that seems to imply CPU usage was also 100% but it's actually 100% out of 400% (4 cores on the VM), so it seemed like there wasn't any real benefit to the threading via dask delayed, assuming that's how it still works. I verified by running a big random matrix multiplication which produces a similar graph at 400% CPU utilization across the board. All the related code is in bgen_reader_prof.ipynb.

I'm assuming that the GIL is the culprit with the default Dask scheduler though I'm curious @horta if you did anything special to see higher concurrency when you switched to Dask in the past? I could try with scheduler='processes' but my experience with that is that it usually ends up being pretty close to a wash with scheduler='threads' because of all the serialization for inter-process communication. Let me know if you have any advice.

EDIT: I tried it with the multiprocessing scheduler instead and it was very slow (on track for ~11 hrs) but that's not very fair because I'm asking it to send all the data back to the driver process rather than some reduction over it. It would be great if Dask didn't have this inter-process communication bottleneck -- it keeps coming up!

eric-czech · 2020-08-17T15:10:53Z

Also out of curiosity @horta I was trying to find some commits related to your changes that made such a huge improvement in 4.0.5. Was there one you think would be good look at to highlight the difference? I was just hoping to get a bit more familiar with how it all works internally. Thanks again for whatever you did!

horta · 2020-08-18T19:06:30Z

This one: b40fc36
Before, with a dataset having 500k samples, it would read one variant per time. Per time meaning passing a single variant offset to the function read_genotype_partition, called from here:

bgen-reader-py/bgen_reader/_genotype.py

Line 42 in eceda00

g = read_genotype_partition(bgen_filepath, vaddrs)

Everytime the function read_genotype_partition is called, a bgen_file is opened (which opens the bgen file, read the header, and ftell). So minimizing that call will minimize that overhead.

There are basically only two overheads that, if minimized, would make everything even faster:

Opening the metafile bgen_metafile
Opening the bgen file bgen_file

If you have a single core to use and want to process the whole bgen file as fast as possible, it would suffice to do something like:

with bgen_metafile(metafile_filepath) as mf:
    with bgen_file(bgen_filepath) as bgen:
        # Loop over variants in sequence

You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py

I believe Carl's interface try to work along those lines but using memmap from numpy.

horta · 2020-08-18T19:09:10Z

You will find examples of using cbgen here: https://github.com/limix/cbgen-py/blob/master/cbgen/test/test_bgen.py
Or you can wait until we make new releases of bgen-reader.

eric-czech · 2020-08-18T20:18:27Z

Everytime the function read_genotype_partition is called, a bgen_file is opened (which opens the bgen file, read the header, and ftell). So minimizing that call will minimize that overhead

👍

You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py

Awesome! Ok, I will give that a spin -- thanks for the examples!

eric-czech · 2020-08-19T19:44:10Z

Hey @horta I tried using cbgen rather than bgen-reader and everything worked nicely!

As an experiment I was going to try to lift the file open/close logic out of _bgen_file.py#L70-L71 into my own loop but that's a bit out of my depth. I wasn't sure what the offset should be in:

def read_genotype(self, offset: int) -> Genotype:
    gt: CData = lib.bgen_file_open_genotype(self._bgen_file, offset)
    ...
    lib.bgen_genotype_close(gt)

Is there a fairly easy way to do that?

eric-czech · 2020-08-19T19:55:12Z

Oh and FYI here is a resource profile I had in using cbgen to read a UKB contig:

This is from using a cbgen_reader.py wrapper I made to potentially replace the sgkit-bgen#bgen_reader.py wrapper and I thought the sawtooth memory usage was odd. I didn't see that earlier in #30 (comment) when reading the variants in order but I'm not sure what is being "built up" in this case. The wrapper is pre-allocating result arrays so the memory usage shouldn't be so incremental.

Might there be anything in cbgen or bgen_reader that would cause this gradual increase in memory used when reading blocks of variants out-of-order?

horta · 2020-08-19T20:56:44Z

Hey @eric-czech , the offset for each variant is the result of ftell on the start of the variant block in the bgen file: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145

There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n without knowing where the variant n-1 started (and knowning the size of variant block n-1.

That is one of the reasons index files exist for bgen files.

horta · 2020-08-19T21:08:20Z

Regarding memory, I don't know. Isn't that just garbage collection by Python? (you can try to use gc.collect() to exclude that out)
Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion, why that patteern would be so precise? It looks odd to me too but I've to admit that I don't have much experience on how garbage collection tend to behave. cbgen library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.

eric-czech · 2020-08-19T21:24:40Z

There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n without knowing where the variant n-1 started (and knowning the size of variant block n-1.

Is it not possible to:

Use the metafile to get the offset to the first variant in a block of variants to be read (block meaning some number of sequential variants)
Do a lib.bgen_file_open_genotype(self._bgen_file, offset) with that offset
Read data for the first variant
"Advance" the open file to the next offset without closing it
Read data for the next variant
... and so on

I have no idea how the "Advance the open file" part would work in cbgen -- is that more difficult than it sounds?

Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion

Mm well what I mean is Dask asks us to fetch some block of adjacent variants (say 100 of them), and there guarantees on which order it will ask for the blocks but they do typically go in order. I.e. for two workers process/thread 1 fetches variants 0 to 100 while process/thread 2 fetches variants 100 to 200 then worker 1 does 200 to 300, etc. "Arbitrary blocks of some size" is probably what I should have called it rather than random.

cbgen library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.

Makes sense. It's probably something in Dask or our wrapper then -- I saw a similar pattern when making a Dask wrapper around pybgen too.

horta · 2020-08-20T21:55:33Z

I've created an issue to perform "advance the open file". I think it won't change anything: opening variant genotype in sequence should have the same effect as "advancing the open file". Important bit of the C library for that: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145

I indeed perform a fseek each time one opens a variant genotype but I believe the underlying implementation is smart enough to figure out that we are already at the right file position. (I just couldn't find an explicit explanation about that on google)

horta · 2020-08-20T22:22:36Z

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position (I'm assuming here that performing fseek that amounts to the current position will have zero overhead.)

eric-czech · 2020-08-24T08:51:56Z

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position

Ah ok, thanks @horta.

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).

Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:

Take a single variant and a subset of adjacent samples to get data for
Seek to the start of the data for the variant
Further seek to the start of the data for the sample block
Scan and read all probabilities until you get to the end of the sample block

Let me know if that doesn't make sense.

tomwhite · 2020-08-24T10:05:12Z

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks.

Another way of tackling this would be to convert to an intermediate chunked format (like zarr). This would be done once, then analysis would be performed on the intermediate format, which would have efficient, chunked read patterns. Perhaps this could be something we recommend for large numbers of samples?

This is the way I am doing it for VCF, which is the only real option as we don't have offset information for the n-th variant, so we need to convert to something like zarr anyway.

eric-czech · 2020-08-24T10:49:38Z

Another way of tackling this would be to convert to an intermediate chunked format (like zarr).

That's basically what I'm after, a more efficient way to do that conversion. Are you talking about some process for it that doesn't use Dask?

I'd like to think of this as a one-time conversion that we eat the cost of but even now when I'm using ridiculously wide chunks (~1,000 x ~100,000) to avoid some of the repetitive sample reads, it looks it will cost roughly $300 for the vCPU+memory hours (storage aside). That isn't so bad I suppose but I'd like to get those chunks down to more like 10k x 10k, which would hypothetically cost ~5x as much if the write and read workloads are fairly balanced (IO/network bandwidth aside).

tomwhite · 2020-08-24T12:23:06Z

Are you talking about some process for it that doesn't use Dask?

No, it still uses Dask, but instead of using dask.array.from_array to create a Dask array and then write it to Zarr, use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly. This should avoid repetitive sample reads, while still getting chunking in the samples dimension (by setting the Zarr chunks appropriately).

I do something similar for VCF, but it should be simpler for BGEN since you can read variant ranges that align with Zarr chunks, which means you can avoid the concat and rechunk operations that are needed for VCF. See https://github.com/tomwhite/sgkit-vcf/blob/master/sgkit_vcf/vcf_reader.py#L206-L221

eric-czech · 2020-08-24T12:56:52Z

use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly

Ahhh I see what you mean. I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on. I'll try a similar bgen_to_zarr function instead.

tomwhite · 2020-08-24T13:20:21Z

I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on.

Exactly. The genotype blocks in BGEN are compressed using zlib or zstd, and so there's no way to avoid decompressing the whole block to only pull out a subset of samples.

horta · 2020-08-24T23:11:11Z

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position

Ah ok, thanks @horta.

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).

Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:
* Take a single variant and a subset of _adjacent_ samples to get data for

* Seek to the start of the data for the variant

* Further seek to the start of the data for the sample block

* Scan and read all probabilities until you get to the end of the sample block
Let me know if that doesn't make sense.

If compressed, the whole structure of probabilities, ploidy, phasedness, missingness, of a single variant that includes all samples has to first me decompressed. So I think the answer to your question is no whenever bgen file is using compression (which I believe will be in most cases.).

If it is not compressed, then it seems possible to just read only the probabilities of the samples your are interested in (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html).

horta · 2020-08-24T23:13:05Z

Ops, you already figured out the answer. Sorry, didn't read before answering!

hammer mentioned this issue Jul 23, 2020

Investigate reading a subset of BGEN samples (columns) sgkit-dev/sgkit-bgen#2

Open

hammer mentioned this issue Jul 24, 2020

Investigate BGEN reader in PySnpTools sgkit-dev/sgkit-bgen#3

Closed

eric-czech mentioned this issue Aug 12, 2020

Removing slash from variable names sgkit-dev/sgkit-bgen#4

Merged

eric-czech mentioned this issue Aug 18, 2020

Fix read_bgen performance issues sgkit-dev/sgkit-bgen#9

Open

eric-czech mentioned this issue Aug 19, 2020

Preallocate result arrays and apply earlier slicing sgkit-dev/sgkit-bgen#12

Merged

horta mentioned this issue Aug 20, 2020

Implement advance genotype reading limix/cbgen#1

Open

eric-czech mentioned this issue Aug 24, 2020

Add a bgen_to_zarr function sgkit-dev/sgkit-bgen#16

Open

hammer mentioned this issue Aug 27, 2020

Use cbgen instead of bgen_reader sgkit-dev/sgkit-bgen#20

Closed

Feature request: bulk distributed access #30

Feature request: bulk distributed access #30

Comments

tomwhite commented Jul 23, 2020

CarlKCarlK commented Jul 23, 2020 via email

tomwhite commented Jul 24, 2020

horta commented Aug 1, 2020

horta commented Aug 10, 2020 • edited Loading

horta commented Aug 10, 2020

hammer commented Aug 10, 2020

eric-czech commented Aug 10, 2020 • edited Loading

hammer commented Aug 10, 2020

CarlKCarlK commented Aug 10, 2020 via email

CarlKCarlK commented Aug 10, 2020 via email

horta commented Aug 10, 2020

horta commented Aug 10, 2020

CarlKCarlK commented Aug 10, 2020 via email

horta commented Aug 11, 2020

eric-czech commented Aug 11, 2020

eric-czech commented Aug 17, 2020 • edited Loading

eric-czech commented Aug 17, 2020

horta commented Aug 18, 2020

horta commented Aug 18, 2020

eric-czech commented Aug 18, 2020

eric-czech commented Aug 19, 2020

eric-czech commented Aug 19, 2020

horta commented Aug 19, 2020

horta commented Aug 19, 2020

eric-czech commented Aug 19, 2020

horta commented Aug 20, 2020

horta commented Aug 20, 2020

eric-czech commented Aug 24, 2020

tomwhite commented Aug 24, 2020

eric-czech commented Aug 24, 2020

tomwhite commented Aug 24, 2020

eric-czech commented Aug 24, 2020

tomwhite commented Aug 24, 2020

horta commented Aug 24, 2020

horta commented Aug 24, 2020

horta commented Aug 10, 2020 •

edited

Loading

eric-czech commented Aug 10, 2020 •

edited

Loading

eric-czech commented Aug 17, 2020 •

edited

Loading