-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignores missing entries in .Rtab files #158
base: master
Are you sure you want to change the base?
Conversation
Regarding unit testing: After the change, all tests passed. Should we / I write new tests or rather change the |
Thanks for the PR; maybe making a new Rtab file with missing values and making sure those are handled correctly in an ad-hoc unit test? Happy to prepare it if you can prepare a simple example file. |
Sorry for the delay! I added a simple unit test now. |
When looking at the code I realised that missing values are always replaced by |
Thanks, I'll review the changes shortly and merge. I checked again and we do ignore the samples with missing values (see here: Line 427 in 1359b61
I think we may want to assume that an empty "cell" also means that the gene has missing information on its status, other than the |
Ah wait I think we might have had a misunderstanding (my fault really!). How we handle missing variants in VCF files is a bit convoluted because we have to handle corner cases due to burden testing. We do assign a missing value if a variant is missing from the VCF file ( Line 479 in 1359b61
So I think we should not merge this, unless it is not working as intended as it is (i.e. the Sorry for not thinking this through earlier! |
I see. I can't say that I understand all the edge cases and how they are handled in the code. However, there is also Line 486 in 1359b61
which removes samples from the dict if the second haplotype is also '.' (which should always be the case in bacterial VCFs). The change in cfed703 should do the same thing for .Rtab files, but I don't know how this relates to burden testing etc.
|
Missing data is difficult, and in my opinion is something that's more of a user consideration. But it is expected in many cases. I think the desired pyseer behaviour would be:
Although I'm open to suggestions! Using If not, we should do a simple fill. plink has a similar option for missing data e.g. use the reference, use the major allele, use the minor allele. In our case I think we just want use major allele or use reference (0). We could either just pick one of these, or we could add as a command line option. |
No description provided.