Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference not detected if lowercase characters #64

Open
BiKC opened this issue Jun 9, 2020 · 4 comments
Open

Reference not detected if lowercase characters #64

BiKC opened this issue Jun 9, 2020 · 4 comments
Labels

Comments

@BiKC
Copy link

BiKC commented Jun 9, 2020

In V 1.2.0, if the reference was "a", and GTC said "A", this wouldn't be a problem. However, in V1.2.1, this is no longer the case and causes problems like
10 779284 GSA-rs2486591 g A,G . PASS

V1.2.0 would have been:
10 779284 GSA-rs2486591 G A . PASS

@jjzieve
Copy link
Contributor

jjzieve commented Jun 9, 2020

Ok, thanks for bringing this to our attention. I will look into it.

@jjzieve
Copy link
Contributor

jjzieve commented Jun 17, 2020

Having trouble reproducing this. Your reference genome.fa has lowecase characters, is that correct?
e.g.

>1
atcg...

Also, which product are you running? GSA version 3?

@BiKC BiKC changed the title Reference not detected if small letters Reference not detected if lowercase characters Jun 24, 2020
@BiKC
Copy link
Author

BiKC commented Jun 24, 2020

Indeed version 3, GRCh37. We do use a custom genome file however, that only contain the major chromosomes and mitochondrion.

The problem we have is that the genome has both upper and lowercase characters, and (what I think) when one of the lowercase letters is compared to the GTC files that have upper case characters, it thinks it isn't the same. This causes the problem in the VCF file which in turn causes problems further downstream our pipeline.

agcaaaaagggcctctctgaacagattctcatgctgcctgctatgtcagg agtaagcaccttctttgtctctgactcaggagtctcaggtcatgctacca tcatttatgaagttgtgattgctgaacatgttagattgcaaacgagtaaa caggtcagaccctttacTAAGTTGATACCACTTAATTGCATTCTGAATTC CTTGTTCTGCAACACTTCAAATGACAGAGGTTTCAGCCTCCAGCTAGATA TGGACTCTTAAAAAATGTCCTAATCAGAATTCTGTAGACTCTTTTACaca gaattctgggtacaaacatcctctgtactcagaactttgaatgtacgtgt atattgtctcctggtactggtgctgaggatgaggattccagaggcttact attcttttcctgatgtcctttaggtctgtttgttaaagcttttattgttt tcctcctggatgctttctggtctcctgttttgtacgtggtcttatgcaat

This is a screenshot of the output we had in version 1.2.0 (left) and 1.2.1 (right) with the exact same steps.
MicrosoftTeams-image (1)

Here is another example of the issue:
MicrosoftTeams-image (2)

Here an example of the error we get further downstream with an imputation tool called Beagle:
MicrosoftTeams-image (3)

@jjzieve
Copy link
Contributor

jjzieve commented Jul 9, 2020

You're correct in that the bug exists if the reference genome is lowercase. I still wasn't able to reproduce it ever working when reverting back to 1.2.0 or even older though. Is it possible a different fasta file was used or something? I just pushed https://github.com/Illumina/GTCtoVCF/tree/bug/fix-lowercase-ref-genome can you confirm if that fixes the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant