Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange TSS distance for very similar genomic identity sequences #50

Open
jianshu93 opened this issue Jan 4, 2022 · 3 comments
Open

Comments

@jianshu93
Copy link

Dear TSS team,

I am testing the TSS using real world bacteria genomes. There are 8 of them and all are some what fragmented (e.g.,1-30 fragments). This is a great example to test whether TSS is robust to global mutation because we do not get complete circular genomes from our sequencing experiment of each bacterial isolate. They are eight Shewanella Baltica species, average nucleotide identity (https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000760?crawler=true) among them is very high, 95%-99%, meaning they are very similar. I want to test how TSS will perform for close related sequences and global mutation compare to kmer based method for close related sequences. I have very high spearman rank correlation (1 actually) coefficiency for kmer based distance and ANI, but very bad for TSS distance with ANI. I attached the figure for TSS vs ANI and all the concatenated Shewanella Baltica genome (so in one piece) for you to test.

The command I use for running TSS is:

./sketch -i S_Baltica_new -m TSS -f fasta -o S_Baltica_new_TSS_triangle

It seems TSS varies a lot even ANI/kmer based method is quite consistent from the figure. Do you have any explanations for this. Will TSS lose resolution for close related sequences (where kmer based works very well) but only works for divergently related ones? If so, how to benefit from both?

Thanks,

Jianshu

S_Baltica_new.zip

ANI_TSS_S_Baltica.pdf.zip

@ajoudaki
Copy link
Collaborator

ajoudaki commented Jan 14, 2022

Dear Jianshu,

I can make two general comments about your results.

  1. TSS outperforms other sketching methods for distantly related genomes, as opposed to very similar pairs, which isn't the case for very high ANI
  2. The TSS distance is not a linear function of ANI, therefore, difference in TSS distance do not reflect a similar drop/increase in ANI.
  3. You can increase the sketch dimension to --embed_dim=30 to get more accurate sketches. I've attached the results I get with this modification S_Baltica_new_TSS_triangle.zip

That being said, if you can share the script to reproduce the figure, I might be able to add non-linear transformation that will suit your problem.

@jianshu93
Copy link
Author

Hello Amir,

See the attached files. I plot your TSS distance with ANI but still not a monotonic (I was not expecting linear) relationship. Still bad for similar genomes.Maybe even larger embed dimension?

Archive.zip

Thanks,

Jianshu

@jianshu93
Copy link
Author

Hello team,

Any update on the question I mentioned above, when there are closely related and distantly related genomes at the same time?

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants