Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The calculation method of AP field #52

Open
xbz17 opened this issue Dec 14, 2024 · 3 comments
Open

The calculation method of AP field #52

xbz17 opened this issue Dec 14, 2024 · 3 comments

Comments

@xbz17
Copy link

xbz17 commented Dec 14, 2024

Hello,
Thanks for your wonderful tools TRGT. Currently I'm using the version1.1.1 with the TR catalog from Platinum Tandem Repeats, and have some problems in the AP field of output-vcf.
One of the TR result like this:
image

sequence:
GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCTCCCCTCATCACCTCCCCAGCCAC

and the plot:
Pasted image 20241214122956

The AP of this TR is "0.145455,0.145455" which is a very low value comparing to most of other TR. And the motif ACCC has two repeats in two places, which are separated by a long sequence compared with the length of motif.
I would like to know how the AP is calculated here, 0.145455 seems like the result of 4*4(ACCC)/110(length of whole suquence), and whether the user should be warned in the output that there is a big break in the repetition of this TR? Because there are also some other STR results that retrieve all parts of a long sequence that match the motif, but are not actually "tandem", result in low AP values as well.

@egor-dolzhenko
Copy link
Collaborator

Thank you for the question. The AP / purity field is meant to indicate how close an allele sequence is to being a perfect repeat composed of the specified motif(s). (The actual algorithm is based on computing the edit distance between the given sequence and the corresponding perfect repeat.) It sounds like your understanding is correct: When the purity is low, the allele sequence contains a small number of perfect motif copies relative to its length. This can occur in several ways: the allele can contain a few perfect motif copies with the rest of the sequence not matching the motif at all; or there could be many imperfect motif copies scattered throughout the repeat sequence. The information about the location of these matches can be found in the MS field (described here). I agree that it would be convenient to add additional output fields that summarize different repeat configurations (especially for low purity repeats like in your example). This is something that we are continuing to work on. Did I answer your question?

@xbz17
Copy link
Author

xbz17 commented Dec 20, 2024

Dear Egor,
Thanks for your explanation. It is true that TRs with low AP values have two situations you mentioned above - scattered motifs with many other sequences between them, or relatively perfect motifs with long flanking sequences just like:
image
But I have observed that the latter situation is often caused by the given TR catalog——TRGT outputs more unnecessary flanking sequences in the seq part of vcf because of the position in the second and third columns of input bed file, and AP value will also consider the whole length of these sequences when calculating. I think maybe we should search motif first and the total length L in the AP calculation is determined according to the positions of the leftmost and rightmost in the MS field. For example, the AP value of this TR should be (16-1)/16.

@egor-dolzhenko
Copy link
Collaborator

Thanks for the suggestion! Yes, currently the AP field is based on the entire allele sequence, even if most of it does not match the specified repeat motif(s). We were thinking about developing a small add-on tool for TRGT that would report per-motif stats, including purity (which will be very similar to your description). Do you think this would be a good way to address your suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants