Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An issue about complex STRs #19

Open
fjmuzengyiheng opened this issue Jan 7, 2023 · 12 comments
Open

An issue about complex STRs #19

fjmuzengyiheng opened this issue Jan 7, 2023 · 12 comments

Comments

@fjmuzengyiheng
Copy link

fjmuzengyiheng commented Jan 7, 2023

Hi, @readmanchiu , Sorry for bother again.
I am using this first-tier tool for STR counting for my neurogenetic patients.

Here is one issue I want to report:

There is one disease named "CANVAS (Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome)", which is caused by an expansion of (AAGGG)n repeat in RFC1 gene. (https://omim.org/entry/102579?search=RFC1&highlight=rfc1)

The sticky situation lies in:

  1. The reference sequence is (AAAAG)n for this loci (hg38, 4:39,348,424-39,348,483).
  2. There is an (AAGGG)n expansion in my data (ONT), which is confirmed by visualizing manually by IGV. (attached below)
  3. When I use the bed file below, Straglr outputs nothing.

【bed file】
4 39348424 39348483 AAAAG"

I am not willing to give up this tool for its outstanding performance. Can Straglr deal with this "complex STRs (changed motif situation)"? I will be pleased if Straglr can deal with this situation, which will make it SUPER perfect.
Thank you!

[attached file1: IGV visualization for my patient's ONT data]
image

[attached file2: 3510bp insertion in the first line]

AGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGCGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGCAATACAGAAGAAGAAGTAATACAGAAGGAAGGAAGGAAGGGAAGGGAAGGAAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGCAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAAGGAAGGAAGGGAAGGGAAGGGAAGGGAGGAAGGGAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGCGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG

@readmanchiu
Copy link
Collaborator

Hi @fjmuzengyiheng,

Thanks for the wait, a new version has been made.
Please give it a try to see if it has any problem genotype this event.
Please let me know.

@fjmuzengyiheng
Copy link
Author

fjmuzengyiheng commented Feb 6, 2023

Hi @fjmuzengyiheng,

Thanks for the wait, a new version has been made. Please give it a try to see if it has any problem genotype this event. Please let me know.

Thank you. It is nice of you. I will try this version as soon as possible. Thank you again!

@fjmuzengyiheng
Copy link
Author

Hi, @readmanchiu
I've tested the new version (v1.4.0) of Straglr for genotyping this locus.

when I provided bed file:
4 39348424 39348483 AAAAG
straglr outputs:
image

when I provided bed file:
4 39348424 39348483 AAGGG
straglr outputs:
image

It is still not so perfect to genotype this locus.
Do you mind if i provide the bam file of my patient to test this locus? Thank you so much.

@readmanchiu
Copy link
Collaborator

Have your tried "AAGGG"? seems like this instead of AAAAG is the predominant motif in your sequence.
I've worked with a heterozygous case where at least the normal allele has the reference allele (the expansion has the different one).
It's tricky for homozygous cases where only the non-reference allele exists.
But please send me the bam file via e-mail (or tell me how I can access it in the email), I'm more than happy to tackle it.

@readmanchiu
Copy link
Collaborator

Issue followed up through private communication

@ljohansson
Copy link

ljohansson commented Jul 9, 2024

I was wondering about how this topic continued. I am struggling with exactly the same gene and my cram file looks very similar to the one in this topic. However, there are two allels, with one having ~470 inserted bases with sequence AAAAG, sometimes interrupted by AAAG. The siecond has ~1800 inserted bases with an AAGGG pattern. However, in de tsv (two example lines below) both show the AAAAG pattern.

#chrom start end repeat_unit genotype read copy_number size read_start strand allele
chr4 39348424 39348485 AAAAG 533.4(4);10.8(17) bc090253-ae5f-42a1-9eea-c2cd914df67b 528.0 2640 5189 - 533.4
chr4 39348424 39348485 AAAAG 533.4(4);10.8(17) 5a0099fe-1aed-46b0-b7cf-c9543ac7e98e 11.2 56 4253 - 10.8

I am using the Straglr implementation from the molgenis fork, https://github.com/molgenis/straglr, which is based on the philres fork, so there may be some divergence of your repository. However, because this topic is already discussed here I ask the question here as well. In the input catalogue we have put AARRG as a sequence, with R being the IUPAC code for A or G. As such I would expect it to pick up both patterns.
How is the pattern in column 4 of the tsv determined? Is it based on the sequence given as input or on the actual sequences, and does straglr support two different sequences in a heterozygous situation?

Thank you!

@readmanchiu
Copy link
Collaborator

Hi @ljohansson
I encourage you to try our latest version (v1.5.0)
There is a column called "actual_repeat" in the TSV that will show the actual repeat detected.
For example, here is a case of heterozygous RFC1 locus, where one allele is the reference size and motif AAAAG and the second one is a multi-kb expanded AAGGG allele

#chrom	start	end	target_repeat	locus	coverage	genotype	read_name	actual_repeat	copy_number	size	read_start	strand	allele	read_status
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	2aad9f25-e3d6-4674-b1a4-09a713d2f569	AAGGG	666.6	3333	15098	-	3234.2	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	977a4ec7-e32a-4ae1-ba34-924dbfaee1a6	AAGGG	654.0	3270	252	-	3234.2	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	b895678a-9756-4869-8a1b-69cd9f27e0a8	GGAAT	634.4	3172	4440	-	3234.2	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	23306847-fbf6-46f9-ae1d-aa8234dfab2a	GGAAG	632.4	3162	14615	+	3234.2	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	cb07287d-7c29-47bc-925e-03a894bb0c38	AAGGG	586.8	2934	16779	+	3234.2	partial
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	73ebd51f-e4f0-494c-beef-f8f65b574e53	GGAAG	234.2	1171	6560	-	3234.2	partial
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	cbd803a8-930e-4124-b19a-2944f672502a	GAAGG	176.8	884	19877	-	3234.2	partial
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	7e73b5eb-c666-4361-b060-a7aff15f9728	AAAGA	14.6	73	3916	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	f8757228-839d-42c6-81a8-267bcc84da0c	AAAGA	12.2	61	3988	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	9075dc56-c51b-41d5-ad59-e58c0ed052c3	AAAGA	12.0	60	5829	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	9c7c4d99-a460-477e-9c89-d37a0ce36db0	AAAGA	11.8	59	1075	+	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	cd3433cb-6176-4e9d-a616-58c02c1d5e69	AAAGA	11.6	58	14957	+	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	eccf6e56-b0d3-40c3-b39f-96480f8023d6	AAAGA	11.6	58	16701	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	b522eecd-0451-4fb1-8c43-45c9c80c0a12	AAAGA	11.6	58	103	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	1919773e-32e4-49c5-84e5-c687a4befdc6	AAAGA	11.4	57	2638	-	57.8	full
chr4	39348425	39348483	AAAGA	chr4:39348425-39348483	25	3234.2(7);57.8(16)	b6067f7a-6582-4381-bbf3-b666d3f649a1	AAAGA	11.4	57	13629	-	57.8	full

and the corresponding VCF:

(base) [rchiu@hpce706 BTL-2139]$ more rfc1.vcf 
##fileformat=VCFv4.2
##fileDate=20240709
##source=StraglrV1.5.0
##reference=/projects/btl/rchiu/hg38.fa
##contig=<ID=chr4,length=190214555>
##INFO=<ID=LOCUS,Number=1,Type=String,Description="Locus ID">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of repeat">
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit in the reference orientation">
##INFO=<ID=REF,Number=1,Type=Float,Description="Reference copy number">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=AL,Number=.,Type=String,Description="Allelic lengths">
##FORMAT=<ID=ALR,Number=.,Type=String,Description="Allelic length ranges">
##FORMAT=<ID=AC,Number=.,Type=String,Description="Allelic copies">
##FORMAT=<ID=ACR,Number=.,Type=String,Description="Allelic copy ranges">
##FORMAT=<ID=AD,Number=.,Type=String,Description="Allelic depths">
##FORMAT=<ID=ALT_MOTIF,Number=.,Type=String,Description="Alternate motif(s)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	.
chr4	39348425	.	AAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAA	AAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGG
AAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGA
AGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAA
GGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGA
AGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGA
AGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAG
GAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAA
GGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAA
GGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAG
GGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGTGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAG
GGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG
AAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAG
GGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG
AAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGA
AGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGA
AGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAA
GGGAAG	.	PASS	END=39348483;RU=AAAGA;REF=11.6	GT:DP:AL:ALR:AC:ACR:AD:ALT_MOTIF	0/1:25:57.8/3234.2:52-73/884-3333:11.6/512.2:10.4-14.6/176.8-666.6:16/7:./AAGGG(6);GGAAT(1)

@readmanchiu readmanchiu reopened this Jul 10, 2024
@ljohansson
Copy link

Hi @readmanchiu. Thank you. This seems like what I am looking for. What would happen in case of two alternative alleles, as in the example? Would both entries end up in the same line in the vcf?

@readmanchiu
Copy link
Collaborator

They would both be shown in the ALT column, separated by a comma

@ljohansson
Copy link

ljohansson commented Jul 11, 2024

Version 1.5.0 worked like a charm. As far as I can currently oversee all information I need is in the vcf. However, I have one more question. In my current RCF1 use-case there were two alleles with AAAAG and AAGGG repeats, respectively. I used AA*** in the bedfile and captured both sequences. In literature also an alternative motif ACAAG is described. What would be the best entry in the bed file to capture al three sequences. It seems that A**** is too generic. Could A***G work? Or do you have a better suggestion?

@readmanchiu
Copy link
Collaborator

@ljohansson I am preparing a new release where you can specify a '-' in the target motif field in the bed file to indicate the detected motif may be different from the target so that it is not necessary to have a match (between target and detected)
This would avoid all the fuss caused by the RFC1 expansions.
I was going to make a release soon but I want to resolve your new issue of discrepancy in reported allele lengths before doing that.

@ljohansson
Copy link

@readmanchiu. Thank you. I am looking forward to the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants