You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I had a question about the DSC_estimated_error_rates.pdf calculation. For one example sample, for one specific trinucleotide context, I tried to reconstruct the calculation:
ACG>T = 150 -> count from results.SSC-mismatches-Pyrimidine.triprofiles.tsv
ACG>T = 15 -> count from results.SSC-mismatches-Purine.triprofiles.tsv
ACG trinucleotide count = 7674995 -> count from results.trint_counts_and_ratio2genome.tsv tri_bg column, which is the number of times ACG/CGT base pairs are observed, which is also equivalent to the number of times ACG is observed in single-strand consensus sequences.
ACG error rate I see in DSC_estimated_error_rates.pdf is ~1.5e-10.
I infer this was calculated as: (150/[7674995/2])*(15/[7674995/2]),
which matches this code line in nanoseq_results_plotter.R:
y = (tri_obs_pyr / (tri_bg[tris] / 2)) * (tri_obs_pur / (tri_bg[tris] / 2));
However, why was the ACG trinucleotide count of 7674995 divided by 2 for both the pyrimidine and purine parts of the error rate calculation?
I would have thought that the calculation would be:
[probability of having an ACG>T change in one strand = 150 / 7674995] X [probability of having a CGT>A change in the other strand = 15 / 7674995]
= (150/7674995)*(15/7674995) = ~3.8e-11
This is lower by a factor of 4 from the DSC_estimated_error_rates.pdf estimate.
To visually illustrate:
The difference in my suggestion is I am considering the context count from results.trint_counts_and_ratio2genome.tsv tri_bg column as the denominator, rather than half of that, since the number in the tri_bg column is equivalent to the number of times the ACG trinucleotide context is observed in both strands. And I'm assuming the counts in SSC-mismatches-Pyrimidine.triprofiles.tsv contains the number of times ACG>T is observed in both strands (i.e. sum of ACG>T ssDNA calls in both strands), and likewise for CGT>A in SSC-mismatches-Purine.triprofiles.tsv.
It's a bit tricky to think through, so perhaps you can catch an error in my logic.
Thanks!
The text was updated successfully, but these errors were encountered:
Hi,
I had a question about the DSC_estimated_error_rates.pdf calculation. For one example sample, for one specific trinucleotide context, I tried to reconstruct the calculation:
ACG>T = 150 -> count from results.SSC-mismatches-Pyrimidine.triprofiles.tsv
ACG>T = 15 -> count from results.SSC-mismatches-Purine.triprofiles.tsv
ACG trinucleotide count = 7674995 -> count from results.trint_counts_and_ratio2genome.tsv tri_bg column, which is the number of times ACG/CGT base pairs are observed, which is also equivalent to the number of times ACG is observed in single-strand consensus sequences.
ACG error rate I see in DSC_estimated_error_rates.pdf is ~1.5e-10.
I infer this was calculated as: (150/[7674995/2])*(15/[7674995/2]),
which matches this code line in nanoseq_results_plotter.R:
y = (tri_obs_pyr / (tri_bg[tris] / 2)) * (tri_obs_pur / (tri_bg[tris] / 2));
However, why was the ACG trinucleotide count of 7674995 divided by 2 for both the pyrimidine and purine parts of the error rate calculation?
I would have thought that the calculation would be:
[probability of having an ACG>T change in one strand = 150 / 7674995] X [probability of having a CGT>A change in the other strand = 15 / 7674995]
= (150/7674995)*(15/7674995) = ~3.8e-11
This is lower by a factor of 4 from the DSC_estimated_error_rates.pdf estimate.
To visually illustrate:
The difference in my suggestion is I am considering the context count from results.trint_counts_and_ratio2genome.tsv tri_bg column as the denominator, rather than half of that, since the number in the tri_bg column is equivalent to the number of times the ACG trinucleotide context is observed in both strands. And I'm assuming the counts in SSC-mismatches-Pyrimidine.triprofiles.tsv contains the number of times ACG>T is observed in both strands (i.e. sum of ACG>T ssDNA calls in both strands), and likewise for CGT>A in SSC-mismatches-Purine.triprofiles.tsv.
It's a bit tricky to think through, so perhaps you can catch an error in my logic.
Thanks!
The text was updated successfully, but these errors were encountered: