-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark data #6
Comments
All of the data I used was from the year 1 freeze from the HPRC. I believe there are some use restrictions pending publication though, so you should probably look into that before re-hosting the data somewhere else. |
Ok, thanks. I'm not familiar yet with all the tools you mention to convert the raw data to |
The scripts are all included in this repository (although not especially well documented). My benchmarking tool uses FASTA format for all of the inputs. To extract FASTAs from the HPRC data, I used a two-step process where you first get the read sequences:
And then use the summary table from the first script to extract the corresponding sequence from the reference:
|
Allright, that should get me started.
Then I use Winnowmap to align the reads onto the two(?) references to get a Then the only remaining question is what value you use for |
I think I used a minimum MAPQ of 30, which is a pretty standard threshold for a confidently mapped read (it corresponds to an estimated probability of error of 0.1%). I actually wasn't the one who did the alignments with Winnowmap, but I'm now remembering that there was extra work to handle the maternal and paternal references. I believe he mapped to both of them independently and then chose the better of the two alignments for each read. I'll ask to be sure though. The 95% sequence identity was only used to determine appropriate alignment scoring parameters. |
Would you be able to share the data (ONT reads and assembly contigs) you used in your paper, or a script used to generate it?
I'd like to test my own code on some non-synthetic read/reference pairs, and yours is the first set of 100k+ long reads I see.
(I'll probably end up making a repository with all the testdata I'm using and a snakemake to reproduce it, so it's easy to reuse for others.)
The text was updated successfully, but these errors were encountered: