Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genotypes in probabilities? #16

Open
Hoeze opened this issue May 22, 2019 · 8 comments
Open

Genotypes in probabilities? #16

Hoeze opened this issue May 22, 2019 · 8 comments

Comments

@Hoeze
Copy link

Hoeze commented May 22, 2019

Hi, how do I retrieve the genotype column ID for genotype[x].compute()["probs"]?

@horta
Copy link
Collaborator

horta commented May 22, 2019

Hi @Hoeze , you can use x to index bgen["variants"] and retrieve the information you want about the x-th variant: bgen["variants"].iloc[x].

@Hoeze
Copy link
Author

Hoeze commented May 23, 2019

Thanks for your fast response @horta.
I tried your suggestion like in the following example:

In [22]: from bgen_reader import *                                                                                                                                                                                                          

In [23]: bgen = read_bgen("complex.bgen")                                                                                                                                                                                                   
Reading samples|===========================================================================================================================================================================================================================|
Mapping variants: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 51590.46it/s]

In [24]: geno = bgen["genotype"][8].compute()                                                                                                                                                                                               

In [25]: v = bgen["variants"].compute().iloc[8]                                                                                                                                                                                             

In [26]: print(v)                                                                                                                                                                                                                           
id                                                
rsid                                            M9
chrom                                           01
pos                                              9
nalleles                                         8
allele_ids    A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
vaddr                                          783
Name: 8, dtype: object

In [27]: print(geno["probs"])                                                                                                                                                                                                               
[[ 1.  0.  0.  0.  0.  0.  0.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  0.  0.  1.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  1.  0.  0.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

In [28]: print(geno["probs"].shape)                                                                                                                                                                                                         
(4, 36)

I already found the allele_ids, but how do I know which column in geno["probs"] contains which genotype?
For example, column 1 could contain A/A, A/G, etc.

In the end, my target is to extract in a certain genomic range the most likely sequence for each individual.

@horta
Copy link
Collaborator

horta commented May 23, 2019

That is the tricky part because BGEN allows for very general genotype. It depends wether you have Unphased genotype, Phased genotypes, number of alleles, ploiyd. The quick start gives a quick idea on how to perform the association between probability and genotype (in particular, read the comments in the code). But the ultimate answer is in the section Per-sample order of stored probabilities of bgen specification.

@Hoeze
Copy link
Author

Hoeze commented May 23, 2019

Hm, I see. How do you solve this problem in your projects?
Do you have a method which calculates it?
Otherwise, I could try to write one...

@horta
Copy link
Collaborator

horta commented May 23, 2019

Have a look at allele_expectation: https://bgen-reader.readthedocs.io/en/latest/expectation.html

@horta
Copy link
Collaborator

horta commented May 23, 2019

But make sure you have unphased genotype: "This function supports unphased genotypes only."

@Hoeze
Copy link
Author

Hoeze commented May 23, 2019

Ah, thank you for the hint, I found what I searched for:

def get_genotypes(ploidy, nalleles):

Would it be possible to have this function publicly exported in genotype?

@horta
Copy link
Collaborator

horta commented May 24, 2019

Sure. Will do it for the 3.1.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants