Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apparent date of filtered sequences #159

Open
mdbaron42 opened this issue Jul 23, 2021 · 3 comments
Open

apparent date of filtered sequences #159

mdbaron42 opened this issue Jul 23, 2021 · 3 comments

Comments

@mdbaron42
Copy link

mdbaron42 commented Jul 23, 2021

Thank you for Treetime. I am using it to identify and remove excessively divergent (and probably erroneous) sequences from a large dataset of viral sequences. When TT rejects a sequence it prints out a message like this:

112.27 TreeTime: the following tips have been marked as outliers. Their date
constraints were not used. Please remove them from the tree. Their dates
have been reset:

112.27 MG708138/Nigeria/Kanam/2016, input date: 2016.5, apparent date: 2131.04

which is fine, except that I can't work out where this "apparent date" is coming from. I have been through the code, and you appear to printing the terminal.numdate, but I can't see where that is being set to an estimated date (it seems to start out as a numeric version of the input sample date). I added a few lines of code to write a file with the dates and RTT as used for the root-to-tip-regression plot, and the full slope and intercept, and if I use this to calculate an estimated date for the given RTT distance (having set clock-filter=0 to get what I imagine is close to the initial strick clock), the estimated year does not fit with the "apparent date", at least in my hands. I'm obviously missing something.

I need to know, because the program is flagging a few sequences that do not seem to be particularly deviant in the original tree, and are printed out by TT as having very small deviations between input and apparent date, e.g.

112.27 MH880866/Nepal/Dhading/2010, input date: 2010.792, apparent date: 2016.29

112.27 MH304876/India/Gharsana/2018, input date: 2018.192, apparent date: 2023.33

112.27 KX860060/India/Patna/2010/52, input date: 2010.5, apparent date: 2014.84

(in the context of my data set, these are small differences, and the sequences themselves sit nicely in the middle of a group of geographically and temporally similar sequences, so I am struggling a bit to work out why Treetime is flagging them).
Grateful for any help
Michael

PS Should add that I run TreeTime with the least squares root option, and --branch-mode = "input"

@rneher
Copy link
Member

rneher commented Jul 27, 2021

Hi Michael,

it is a little hard to diagnose without seeing a rtt plot. You could try to remove the big outlier and rerun with --clock-filter 0 to switch off the outlier detection. If you have sufficient temporal signal --reroot least-squares is recommended (which is also the default).

@mdbaron42
Copy link
Author

Hi Richard,

as I said, I am already running with --reroot least-squares (i.e, I am using the default). I am sorry, I have possibly not been clear in my question, which was how you calculate the value of "apparent date" that the program prints out.

I have ended up running my analyses with --clock-filter 0, and getting the program to write a file of all the RTT distances and residuals so that I can then calculate the filter myself to see what is going on. I find that, for example, a filter of 3 * IQD identifies the same suspect sequences as TreeTime does (when run with CF=3), so I appear to be understanding how the filter works :-). However, if i calculate an apparent date from the RTT distance and the clock model slope and intercept, I get very different values to the "apparent date" that Treetime prints out.

I am attaching a RTT plot, created with the clock filter set to 3. I have drawn a ring around a couple of highlighted points. The print out from treetime for these two points is:
131.98 MH880866/Nepal/Dhading/2010, input date: 2010.792, apparent date: 2016.29
131.98 KX860060/India/Patna/2010/52, input date: 2010.5, apparent date: 2014.84

However, the apparent dates from the final clock model in this run are 2046 and 2045 respectively. OK, I understand that that is the model calculated without those points, so I run the program with CF=0, and so get the clock model for all the points; I can then see these two points indeed have residuals that are >3*IQD, but that clock model calculates the apparent years as ~2041 for both (note I am not relying on the clock model as printed on the RTT figure, but have added code to the program to print out the RTT plot data and the un-rounded slope and intercept).

Basically, I am not getting the same numbers as Treetime when I think I should be, and I am worried that I have misunderstood something and that I am doing something wrong. I am writing a paper where I will say we used TreeTime to filter out excessively divergent sequences, and I need to be sure I have understood it! ;-)

Michael
root_to_tip_regression_note.pdf

@rneher
Copy link
Member

rneher commented May 15, 2022

Hi Michael, terribly sorry for the late response. The reason here will be that that the date output by treetime as "apparent date" includes information from the closest relative in the tree. So it is not just extrapolating the root-to-tip distance to a time, but looks at the closest dated common ancestor and extrapolated from there. This will often be much closer than if you extrapolated straight from the root.

best,
richard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants