Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark data #6

Open
RagnarGrootKoerkamp opened this issue Mar 31, 2022 · 5 comments
Open

Benchmark data #6

RagnarGrootKoerkamp opened this issue Mar 31, 2022 · 5 comments

Comments

@RagnarGrootKoerkamp
Copy link

RagnarGrootKoerkamp commented Mar 31, 2022

Would you be able to share the data (ONT reads and assembly contigs) you used in your paper, or a script used to generate it?

I'd like to test my own code on some non-synthetic read/reference pairs, and yours is the first set of 100k+ long reads I see.

(I'll probably end up making a repository with all the testdata I'm using and a snakemake to reproduce it, so it's easy to reuse for others.)

@jeizenga
Copy link
Owner

All of the data I used was from the year 1 freeze from the HPRC. I believe there are some use restrictions pending publication though, so you should probably look into that before re-hosting the data somewhere else.

@RagnarGrootKoerkamp
Copy link
Author

Ok, thanks.

I'm not familiar yet with all the tools you mention to convert the raw data to (read, corresponding reference) pairs, like Winnowmap and the derivation of alignment parameters. If you have a script for this it would be great if you could share it to save me some time figuring everything out ;)

@jeizenga
Copy link
Owner

The scripts are all included in this repository (although not especially well documented). My benchmarking tool uses FASTA format for all of the inputs. To extract FASTAs from the HPRC data, I used a two-step process where you first get the read sequences:

samtools view reads.bam | ./scrape_sequences.py MIN_LENGTH MIN_MAPQ READ_FASTA_OUT_DIR SUMMARY_TABLE_OUT

And then use the summary table from the first script to extract the corresponding sequence from the reference:

./extract_ref_sequences.py REF_FASTA_OUT_DIR SUMMARY_TABLE MATERNAL_REF_FASTA PATERNAL_REF_FASTA

@RagnarGrootKoerkamp
Copy link
Author

RagnarGrootKoerkamp commented Mar 31, 2022

Allright, that should get me started.
If my understanding is correct, The only data files I need are these:

Then I use Winnowmap to align the reads onto the two(?) references to get a .bam file, which is then processed by your scrape_sequences.py and extract_ref_sequences.py.

Then the only remaining question is what value you use for MIN_MAPQ. (Or is this the 95% sequence identity mentioned in the paper?)

@jeizenga
Copy link
Owner

I think I used a minimum MAPQ of 30, which is a pretty standard threshold for a confidently mapped read (it corresponds to an estimated probability of error of 0.1%).

I actually wasn't the one who did the alignments with Winnowmap, but I'm now remembering that there was extra work to handle the maternal and paternal references. I believe he mapped to both of them independently and then chose the better of the two alignments for each read. I'll ask to be sure though.

The 95% sequence identity was only used to determine appropriate alignment scoring parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants