parallel read #12

biona001 · 2020-08-06T02:07:33Z

As suggested by @Hua-Zhou, the nth function in IterTools can indeed get the nth item of a VCF reader. I will implement a parallel read routine in the next few days.

The text was updated successfully, but these errors were encountered:

Hua-Zhou · 2020-08-06T03:29:50Z

That sounds promising. We may assign each thread a block of say 128 VCF records to read. The number 128 needs to be tuned to find the best performance.

…

On Wed, Aug 5, 2020 at 7:07 PM Benjamin Chu ***@***.***> wrote: As suggested by @Hua-Zhou <https://github.com/Hua-Zhou>, the nth function in IterTools <https://juliacollections.github.io/IterTools.jl/latest/#nth(xs,-n)-1> can indeed get the nth item of a VCF reader. I will implement a parallel read routine in the next few days. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGAPMPKMBXWGDNNGNRTKYTR7IF7FANCNFSM4PWDEKDQ> .

biona001 · 2020-08-06T19:08:45Z

Actually, nth will not work... The code basically iterates until the desired record instead of behaving like a pointer. Each thread will still have to iterate through the entire file.

Hua-Zhou · 2020-08-06T19:20:52Z

The original code spends most of its time on parsing VCF records or iterating through the file? I imagine iterating through the file should be fast (just go to the specific line of text file). If most time is spent on parsing the VCF records, then it may be still beneficial to do parallel reading.

…

On Thu, Aug 6, 2020 at 12:08 PM Benjamin Chu ***@***.***> wrote: Actually, nth will not work... The code basically iterates until the desired record instead of behaving like a pointer. Each thread will still have to iterate through the entire file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

biona001 · 2020-08-07T01:12:15Z

Yes, iterating through the file takes ~80% of total time, followed by nrecords which takes like 15% of time. Parsing the record takes only ~5%.

Hua-Zhou · 2020-08-07T03:36:51Z

I see. It's helpful to know that. Then we need to figure out some other strategies to speed up the read process.

…

On Thu, Aug 6, 2020 at 6:12 PM Benjamin Chu ***@***.***> wrote: [image: Screen Shot 2020-08-06 at 6 08 25 PM] <https://user-images.githubusercontent.com/16760873/89597751-d81b5380-d80f-11ea-944d-15f9003caba9.png> Yes, iterating through the file takes ~80% of total time, followed by `nrecords` which takes like 15% of time. Parsing the record takes only ~5%. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGAPMI7MT46VD77DQZHN7LR7NIHXANCNFSM4PWDEKDQ> .

biona001 · 2020-08-22T06:19:01Z

Another possibility is to split the file into x different chunks by line, copy meta information to each, and then read them separately. However, it is impossible to split a gz file into smaller gz files that are each decompressible.

Maybe it is possible to first decompress the gz file, then do splitting and reading.

janxkoci · 2023-12-29T19:49:07Z

Isn't this usually done by indexing a block-gzipped file? The index allows to quickly jump near the position of interest and block gzip format allows unzipping at block boarders. It is why this compression is so popular in bioinformatics (BAM uses this per standard, see e.g. here, and it's very common for VCF too - e.g. bcftools uses this compression, and bgzip can be used to recompress gzipped VCFs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel read #12

parallel read #12

biona001 commented Aug 6, 2020

Hua-Zhou commented Aug 6, 2020 via email

biona001 commented Aug 6, 2020

Hua-Zhou commented Aug 6, 2020 via email

biona001 commented Aug 7, 2020 •

edited

Loading

Hua-Zhou commented Aug 7, 2020 via email

biona001 commented Aug 22, 2020

janxkoci commented Dec 29, 2023

parallel read #12

parallel read #12

Comments

biona001 commented Aug 6, 2020

Hua-Zhou commented Aug 6, 2020 via email

biona001 commented Aug 6, 2020

Hua-Zhou commented Aug 6, 2020 via email

biona001 commented Aug 7, 2020 • edited Loading

Hua-Zhou commented Aug 7, 2020 via email

biona001 commented Aug 22, 2020

janxkoci commented Dec 29, 2023

biona001 commented Aug 7, 2020 •

edited

Loading