Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel read #12

Open
biona001 opened this issue Aug 6, 2020 · 7 comments
Open

parallel read #12

biona001 opened this issue Aug 6, 2020 · 7 comments

Comments

@biona001
Copy link
Member

biona001 commented Aug 6, 2020

As suggested by @Hua-Zhou, the nth function in IterTools can indeed get the nth item of a VCF reader. I will implement a parallel read routine in the next few days.

@Hua-Zhou
Copy link
Member

Hua-Zhou commented Aug 6, 2020 via email

@biona001
Copy link
Member Author

biona001 commented Aug 6, 2020

Actually, nth will not work... The code basically iterates until the desired record instead of behaving like a pointer. Each thread will still have to iterate through the entire file.

@Hua-Zhou
Copy link
Member

Hua-Zhou commented Aug 6, 2020 via email

@biona001
Copy link
Member Author

biona001 commented Aug 7, 2020

Screen Shot 2020-08-06 at 6 08 25 PM

Yes, iterating through the file takes ~80% of total time, followed by nrecords which takes like 15% of time. Parsing the record takes only ~5%.

@Hua-Zhou
Copy link
Member

Hua-Zhou commented Aug 7, 2020 via email

@biona001
Copy link
Member Author

Another possibility is to split the file into x different chunks by line, copy meta information to each, and then read them separately. However, it is impossible to split a gz file into smaller gz files that are each decompressible.

Maybe it is possible to first decompress the gz file, then do splitting and reading.

@janxkoci
Copy link

Isn't this usually done by indexing a block-gzipped file? The index allows to quickly jump near the position of interest and block gzip format allows unzipping at block boarders. It is why this compression is so popular in bioinformatics (BAM uses this per standard, see e.g. here, and it's very common for VCF too - e.g. bcftools uses this compression, and bgzip can be used to recompress gzipped VCFs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants