Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow for large number of genomes (Population) #52

Open
djakubosky opened this issue Sep 26, 2019 · 4 comments
Open

Workflow for large number of genomes (Population) #52

djakubosky opened this issue Sep 26, 2019 · 4 comments
Labels

Comments

@djakubosky
Copy link

Hi, I was curious about the suggested workflow for WHAMG on ~1K genomes. My assumption is that this is what you would want to do:

  1. Run WHAMG individually on all genomes -> many vcfs
  2. Filtering? - (not sure if you'd suggest filtering at this stage of the process on each individual VCF)
  3. Run mergeSVcallers on these VCFs to create a set of positions
  4. Genotype the putative SVs at these positions with something- (eg SVTyper)
  5. Merge genotyped variants into one VCF

Does this sound reasonable? If this is the proposed approach might be helpful to add a little more detail in the wiki!

Thanks for a nice tool!

@zeeev zeeev added the question label Sep 27, 2019
@zeeev
Copy link
Owner

zeeev commented Sep 27, 2019

Hi @djakubosky,

Thanks for reaching out, these are really good questions, especially for large cohorts. I've made some bullet points that might help guide you. If you find this helpful or come up with your own tricks I'd like to add them to a wiki.

  • If you have related individuals batch them together. WhamG becomes more sensitive when you run a trio or quad together. However, there are performance considerations, mostly ram. WhamG stores the discordance read-pair mappings in memory.
  • WhamG accepts and exclusion list, you'll want to use this. If I remember correctly it's just a list of sequences (not coordinates). ChromosomeY, alt haplotypes, and bait sequences are smart to ignore, for WhamG. It will decrease your runtimes.
  • After you generate the calls, take a look at the distributions of the reported info fields. For example I like to look at the support, or depth of the calls. I designed WhamG to be sensitive so you'll have false positives. For example, segmental duplication regions are highly suspicious.
  • Merging helps. If you can run at least 2-3 different callers you gain specificity, and sensitivity. Lumpy/WhamG/Manta seem to be a good combo, although WhamG and Lumpy tend to have a fair amount of overlap (see 1kg phase3 paper). Merging different callers can be a gain pain. There are several tools out there. I wrote one, but I don't know if it's best in class, that was a while ago.
  • For the 1KG project, I found, that genotyping the calls helped a lot; I used SVTYPER. You can do a little mendelian or Hardy–Weinberg filtering. I will caution SVTYPER does have a high FN rate (don't have anything formal to show you).

@djakubosky
Copy link
Author

Hi @zeeev
Thanks for all the info on this, these answers are very helpful. I think I follow almost everything you are saying here. I still have one question- when I generate separate VCFs for each individual (as I won't be able to run them jointly with so many samples) will MergeSVcallers allow me to merge them within and between samples to arrive at a single consensus set of variant positions? Is there some basic filtering to do BEFORE running merging? Ideally of course, it would be good to genotype fewer sites, as SVTYPER is kinda slow sometimes.

One more thing- I will be able to assess quality of variants somewhat indirectly using their reproducibility in pairs of twins in my cohort- happy to share these results with you to give you something slightly more "formal" to illustrate these FN/FP rates- see manuscript here if interested.

@zeeev
Copy link
Owner

zeeev commented Oct 1, 2019

@djakubosky,

Yes mergeSV can be used to merge within, and then between.

You mentioned that you have families in your cohort? I'd joint call closely related individuals (up to 3/4).

--Zev

P.S. nice paper!

@djakubosky
Copy link
Author

djakubosky commented Oct 1, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants