Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ploidy option #5

Open
hepcat72 opened this issue Apr 8, 2019 · 0 comments
Open

Add ploidy option #5

hepcat72 opened this issue Apr 8, 2019 · 0 comments

Comments

@hepcat72
Copy link
Owner

hepcat72 commented Apr 8, 2019

Add --ploidy to vcfSampleCompare. This will affect 2 things:

  • --genotype (--nogenotype)
  • --separation-gap (-a)

Genotype calls can be haploid (ploidy=1), diploid (ploidy=2), etc.. Genotype calls appear in vcf files as a series of slash-delimited digits (0-3 or possibly more). The digits refer to the reference state or one of the comma-delimited variant states. E.g. A genotype call of '0' indicates "same as reference" when ploidy=1. A call of '1' indicates the state is the first of the comma-delimited ALT values when ploidy=1. If ploidy is 2 and the genotype call is (e.g.) 0/0, then both alleles are the same as the reference. 1/1 is both alleles are the first variant. 0/1 is a heterozygous state.

Setting ploidy should affect the usage of genotype calls in the following way:

  • Error if the genotype inferred from the data is a different ploidy than was supplied to the script on the command line.

Setting ploidy should affect allelic frequency in the following way:

  • Separation gap calculation should change
    • ploidy 1 would cause the calculation/comparison to be abs(AO/DP - AO/DP) >= gap
    • ploidy 2 would cause the calculation/comparison to be
      • Closest distance to 0.0, 0.5, or 1.0 for each sample
      • abs(AO/DP - AO/DP) >= gap/2 and (distance_to_closest(AO/DP,(0,0.5,1)) + distance_to_closest(AO/DP,(0,0.5,1))) <= 1/(gap/2) and set to 0 if closest is the same
    • ploidy 3 - same as 2, but with "3" and (0,0.33,0.67, and 1)
    • ploidy n...
    • Instead of distance_to_closest(), I could use distance to call
  • The higher the ploidy, the greater the chance for noise, so the gap should be reported as if ploidy is 1, sort should be as if ploidy is 1, and filter should be based on the actual ploidy (which should cause less to be filtered, the higher the ploidy.
  • Sort first on genotype call, then on separation gap
  • Perhaps I can determine significance given all the data
  • Sub-sort on the allelic frequency difference (genotype difference being the primary sort). If ploidy is wrong, ignore genotype in sort
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant