Benchmark data #6

RagnarGrootKoerkamp · 2022-03-31T04:41:26Z

Would you be able to share the data (ONT reads and assembly contigs) you used in your paper, or a script used to generate it?

I'd like to test my own code on some non-synthetic read/reference pairs, and yours is the first set of 100k+ long reads I see.

(I'll probably end up making a repository with all the testdata I'm using and a snakemake to reproduce it, so it's easy to reuse for others.)

jeizenga · 2022-03-31T16:10:33Z

All of the data I used was from the year 1 freeze from the HPRC. I believe there are some use restrictions pending publication though, so you should probably look into that before re-hosting the data somewhere else.

RagnarGrootKoerkamp · 2022-03-31T16:36:14Z

Ok, thanks.

I'm not familiar yet with all the tools you mention to convert the raw data to (read, corresponding reference) pairs, like Winnowmap and the derivation of alignment parameters. If you have a script for this it would be great if you could share it to save me some time figuring everything out ;)

jeizenga · 2022-03-31T17:41:23Z

The scripts are all included in this repository (although not especially well documented). My benchmarking tool uses FASTA format for all of the inputs. To extract FASTAs from the HPRC data, I used a two-step process where you first get the read sequences:

samtools view reads.bam | ./scrape_sequences.py MIN_LENGTH MIN_MAPQ READ_FASTA_OUT_DIR SUMMARY_TABLE_OUT

And then use the summary table from the first script to extract the corresponding sequence from the reference:

./extract_ref_sequences.py REF_FASTA_OUT_DIR SUMMARY_TABLE MATERNAL_REF_FASTA PATERNAL_REF_FASTA

RagnarGrootKoerkamp · 2022-03-31T18:35:25Z

Allright, that should get me started.
If my understanding is correct, The only data files I need are these:

Maternal & Paternal reference
HG002_ONT_... here

Then I use Winnowmap to align the reads onto the two(?) references to get a .bam file, which is then processed by your scrape_sequences.py and extract_ref_sequences.py.

Then the only remaining question is what value you use for MIN_MAPQ. (Or is this the 95% sequence identity mentioned in the paper?)

jeizenga · 2022-03-31T19:27:01Z

I think I used a minimum MAPQ of 30, which is a pretty standard threshold for a confidently mapped read (it corresponds to an estimated probability of error of 0.1%).

I actually wasn't the one who did the alignments with Winnowmap, but I'm now remembering that there was extra work to handle the maternal and paternal references. I believe he mapped to both of them independently and then chose the better of the two alignments for each read. I'll ask to be sure though.

The 95% sequence identity was only used to determine appropriate alignment scoring parameters.

AndreaGuarracino mentioned this issue Apr 19, 2022

Data sources smarco/BiWFA-paper#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark data #6

Benchmark data #6

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading

jeizenga commented Mar 31, 2022

Benchmark data #6

Benchmark data #6

Comments

RagnarGrootKoerkamp commented Mar 31, 2022 • edited Loading

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022 • edited Loading

jeizenga commented Mar 31, 2022

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading

RagnarGrootKoerkamp commented Mar 31, 2022 •

edited

Loading