Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy number ouput in the MC Tag in Info column #20

Open
Lionward opened this issue Dec 1, 2023 · 1 comment
Open

Copy number ouput in the MC Tag in Info column #20

Lionward opened this issue Dec 1, 2023 · 1 comment

Comments

@Lionward
Copy link

Lionward commented Dec 1, 2023

Hi I was testing your tool on a sample and wanted to check the copy number in a specific region.
if I understood correctly the MC tag in the info field corresponds to the copy number of the motif in the region :
grafik

in this region like many other regions for exmple, the MC column is pointing out the the copy number of both allels are equal and is 75, but as it's clear to see in the seuqence, the copy number should be around 29 if we include all the occurences of A in this sequence.
The 75 is however the length of the region so I was wondering if there is an explaination for this output.

Regards

@Lionward Lionward changed the title Copy number ouput in the MC column Copy number ouput in the MC Tag in Info column Dec 1, 2023
@egor-dolzhenko
Copy link
Collaborator

Thank you for the question! We actually had many discussions about how to properly handle cases like this.

  • The current version of TRGT takes a very simplistic approach and assumes that the entire region must be matched to the specified motif. Because of this, it reported the motif count of 75 with a low repeat purity score of 0.386667 (AP field). Note that 75 * 0.386667 = 29. This seems reasonable to do for relatively pure repeats like the known pathogenic repeats, but can definitely be misleading for repeats that include some flanking sequence that does not match the specified motif at all, like your example.

  • We are re-working the motif-counting algorithm in TRGT to report more sensible counts. For example, this new algorithm should recognize that only the stretch of As in the middle of your sequence is the A homopolymer and report its size in the MC field. Does this sound reasonable? If yes, would you be interested in testing out the pre-release version of TRGT where this algorithm is implemented? If yes, the binaries are attached. It would be very useful to hear your feedback.

trvz-v0.6.1-hmm-switch-linux_x86_64.gz
trgt-v0.6.1-hmm-switch-linux_x86_64.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants