Workflow for large number of genomes (Population) #52

djakubosky · 2019-09-26T22:20:57Z

Hi, I was curious about the suggested workflow for WHAMG on ~1K genomes. My assumption is that this is what you would want to do:

Run WHAMG individually on all genomes -> many vcfs
Filtering? - (not sure if you'd suggest filtering at this stage of the process on each individual VCF)
Run mergeSVcallers on these VCFs to create a set of positions
Genotype the putative SVs at these positions with something- (eg SVTyper)
Merge genotyped variants into one VCF

Does this sound reasonable? If this is the proposed approach might be helpful to add a little more detail in the wiki!

Thanks for a nice tool!

zeeev · 2019-09-27T02:04:25Z

Hi @djakubosky,

Thanks for reaching out, these are really good questions, especially for large cohorts. I've made some bullet points that might help guide you. If you find this helpful or come up with your own tricks I'd like to add them to a wiki.

If you have related individuals batch them together. WhamG becomes more sensitive when you run a trio or quad together. However, there are performance considerations, mostly ram. WhamG stores the discordance read-pair mappings in memory.
WhamG accepts and exclusion list, you'll want to use this. If I remember correctly it's just a list of sequences (not coordinates). ChromosomeY, alt haplotypes, and bait sequences are smart to ignore, for WhamG. It will decrease your runtimes.
After you generate the calls, take a look at the distributions of the reported info fields. For example I like to look at the support, or depth of the calls. I designed WhamG to be sensitive so you'll have false positives. For example, segmental duplication regions are highly suspicious.
Merging helps. If you can run at least 2-3 different callers you gain specificity, and sensitivity. Lumpy/WhamG/Manta seem to be a good combo, although WhamG and Lumpy tend to have a fair amount of overlap (see 1kg phase3 paper). Merging different callers can be a gain pain. There are several tools out there. I wrote one, but I don't know if it's best in class, that was a while ago.
For the 1KG project, I found, that genotyping the calls helped a lot; I used SVTYPER. You can do a little mendelian or Hardy–Weinberg filtering. I will caution SVTYPER does have a high FN rate (don't have anything formal to show you).

djakubosky · 2019-10-01T00:36:54Z

Hi @zeeev
Thanks for all the info on this, these answers are very helpful. I think I follow almost everything you are saying here. I still have one question- when I generate separate VCFs for each individual (as I won't be able to run them jointly with so many samples) will MergeSVcallers allow me to merge them within and between samples to arrive at a single consensus set of variant positions? Is there some basic filtering to do BEFORE running merging? Ideally of course, it would be good to genotype fewer sites, as SVTYPER is kinda slow sometimes.

One more thing- I will be able to assess quality of variants somewhat indirectly using their reproducibility in pairs of twins in my cohort- happy to share these results with you to give you something slightly more "formal" to illustrate these FN/FP rates- see manuscript here if interested.

zeeev · 2019-10-01T22:15:18Z

@djakubosky,

Yes mergeSV can be used to merge within, and then between.

You mentioned that you have families in your cohort? I'd joint call closely related individuals (up to 3/4).

--Zev

P.S. nice paper!

djakubosky · 2019-10-01T22:57:02Z

Thanks for the info! Is there a limit to how many VCFs can be merged with regards to memory concerns?

…

On Tue, Oct 1, 2019 at 3:15 PM Zev Kronenberg ***@***.***> wrote: @djakubosky <https://github.com/djakubosky>, Yes mergeSV can be used to merge within, and then between. You mentioned that you have families in your cohort? I'd joint call closely related individuals (up to 3/4). --Zev — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52?email_source=notifications&email_token=ADSS7NRISMLA7R753XVSOPTQMPDXRA5CNFSM4I27V4BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAC534A#issuecomment-537255408>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADSS7NUELXJZZE6QJEEAPQDQMPDXRANCNFSM4I27V4BA> .

zeeev added the question label Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow for large number of genomes (Population) #52

Workflow for large number of genomes (Population) #52

djakubosky commented Sep 26, 2019

zeeev commented Sep 27, 2019 •

edited

Loading

djakubosky commented Oct 1, 2019

zeeev commented Oct 1, 2019 •

edited

Loading

djakubosky commented Oct 1, 2019 via email

Workflow for large number of genomes (Population) #52

Workflow for large number of genomes (Population) #52

Comments

djakubosky commented Sep 26, 2019

zeeev commented Sep 27, 2019 • edited Loading

djakubosky commented Oct 1, 2019

zeeev commented Oct 1, 2019 • edited Loading

djakubosky commented Oct 1, 2019 via email

zeeev commented Sep 27, 2019 •

edited

Loading

zeeev commented Oct 1, 2019 •

edited

Loading