-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel read #12
Comments
That sounds promising. We may assign each thread a block of say 128 VCF
records to read. The number 128 needs to be tuned to find the best
performance.
…On Wed, Aug 5, 2020 at 7:07 PM Benjamin Chu ***@***.***> wrote:
As suggested by @Hua-Zhou <https://github.com/Hua-Zhou>, the nth function
in IterTools
<https://juliacollections.github.io/IterTools.jl/latest/#nth(xs,-n)-1>
can indeed get the nth item of a VCF reader. I will implement a parallel
read routine in the next few days.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGAPMPKMBXWGDNNGNRTKYTR7IF7FANCNFSM4PWDEKDQ>
.
|
Actually, |
The original code spends most of its time on parsing VCF records or
iterating through the file? I imagine iterating through the file
should be fast (just go to the specific line of text file). If most
time is spent on parsing the VCF records, then it may be still
beneficial to do parallel reading.
…On Thu, Aug 6, 2020 at 12:08 PM Benjamin Chu ***@***.***> wrote:
Actually, nth will not work... The code basically iterates until the desired record instead of behaving like a pointer. Each thread will still have to iterate through the entire file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yes, iterating through the file takes ~80% of total time, followed by |
I see. It's helpful to know that. Then we need to figure out some other
strategies to speed up the read process.
…On Thu, Aug 6, 2020 at 6:12 PM Benjamin Chu ***@***.***> wrote:
[image: Screen Shot 2020-08-06 at 6 08 25 PM]
<https://user-images.githubusercontent.com/16760873/89597751-d81b5380-d80f-11ea-944d-15f9003caba9.png>
Yes, iterating through the file takes ~80% of total time, followed by
`nrecords` which takes like 15% of time. Parsing the record takes only ~5%.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGAPMI7MT46VD77DQZHN7LR7NIHXANCNFSM4PWDEKDQ>
.
|
Another possibility is to split the file into x different chunks by line, copy meta information to each, and then read them separately. However, it is impossible to split a gz file into smaller gz files that are each decompressible. Maybe it is possible to first decompress the gz file, then do splitting and reading. |
Isn't this usually done by indexing a block-gzipped file? The index allows to quickly jump near the position of interest and block gzip format allows unzipping at block boarders. It is why this compression is so popular in bioinformatics (BAM uses this per standard, see e.g. here, and it's very common for VCF too - e.g. |
As suggested by @Hua-Zhou, the
nth
function in IterTools can indeed get the nth item of a VCF reader. I will implement a parallel read routine in the next few days.The text was updated successfully, but these errors were encountered: