In March 2021, Angela Early discovered that the markers of the data used in Taylor et al. 2020 "Identity-by-descent relatedness estimates with uncertainty characterize departure from isolation-by-distance between Plasmodium falciparum populations on the Colombian-Pacific coast" were miss-ordered within chromosomes 5,7-14; see Check_snpname_order_hypothesis.R for confirmation. In addition, I noticed that the confidence intervals presented in Taylor et al. 2020 (and in Taylor et al. Genetics 2019) were computed assuming data were available on all 250 SNPs, i.e. the data simulated during the parametric bootstrap did not have any missing SNPs, even if the real data did have some missing SNPs.
I thus backed up miss-ordered data and key files for regenerating results in Taylor et al. 2020 (see RData/PlosGen2020) and re-ran the analyses of Taylor et al. 2020 using reordered markers (SNPData.RData and hmmInput.txt generated by DataFormat.R), while accounting for missing data in confidence intervals. The results based on the marker-reordered analysis are qualitatively the same as those presented in Taylor et al. 2020 (stored in RData/PlosGen2020), with the following differences (see Compare_components_missordered_ordered.R).
-
CC count: In the marker-reordered analysis we count 45 clonal components (CCs) whereas previously there were 46.
-
Identical CCs: Among the original 46, 44 are identical to 44 of the marker-reordered analysis; however, some of their labels are offset by one: CC24 is now CC23, CCs 26-46 are now 25-45.
-
Different CCs: Among the original 46, two differ: CC23 dissappeared (its two samples are no-longer considered clonal); a sample ("3069D0R") was added to what was CC25, now CC24.
To reproduce the marker-reordered analysis using processed data complete the following four steps.
-
Set the working directory to ColombianBarcode/Code
-
Run Generate_mles_CIs.R (~12 hr, outputs mles_CIs.RData)
-
Run Format_mle_df.R (outputs All_results.RData)
-
Run in no particular order
- Generate_transportdistances_city.R (outputs All_W_results_c.RData)
- Generate_proportions.R (outputs proportions_time.RData and proportions_geo.RData)
- Generate_proportions_sensitivity.R (outputs proportions_sensitivities.RData)
- Run in no particular order
- Plot_proportions_distances.R (outputs Proportions_and_W_distance.pdf)
- Plot_proportions_sensitivity.R (outputs Proportions_sensitivity.pdf)
- Plot_graph_components.R (outputs All_CCs.pdf, Graphs_timespace.pdf and Graphs.pdf)
- Knit to PDF Data_summary.Rmd (outputs Data_summary.pdf)
The extended analysis is based on data with correctly ordered markers (see above) and accounts for missing data in confidence intervals. From within Code/Extended_analysis/, follow the following steps to reproduce the extended analysis (see Code/Extended_analysis/Analysis_summary.pdf for more details) using processed data files metadata_extended.RData (generated in Format_metadata_extended.R) and snpdata_extended.RData (Format_snpdata_extended.R).
- Run Generate_mles_CIs_extended.R
- Run Checking_mles.R to check mles against those of Taylor et al. 2020 and to check for any inconsistencies in lower confidence interval limits near zero
- Run Filter_mles_CIs_extended.R to filter clonal estimates with spurious certainty
- Run Format_mles_CIs_extended.R to add metadata to relatedness estimates
- Run Generate_sids_remv.R to generate a list of samples with one or more NA comparisons. These are samples that need removing before generating graphs
- Run Generate_components.R
- Run Generate_relatedness_to_CCs.R to compute average relatedness between old CCs and new samples, between old CCs and CCs based on new samples only, and between CCs based on all data with and without singletons
- Run Compare_components.R to see how the extended data set clonally clusters with clonal components reported in the marker-reordered analysis of the samples that feature in Taylor et al. 2020 and to check how many components are cliques. Run this script while checking against Clonal_components.pdf, which is generated by Generate_components.R.
- Run Plot_relatedness_graph.R, Plot_relatedness_to_CCs.R,
- Run Plot_extended_components.R and Plot_cc_7.R to see evidence of clonal propagation
- Run Generate_LonLats.R, Generate_fraction_highly_related.R, Plot_fraction_highly_related.R and Generate_regression_trends.R for connectivity analysis
- Run Analysis_summary.Rmd to summarise data on which analyses were done
Other files include summarise_mles.R and generate_counts_table.R (used in Analysis_summary.Rmd); scripts to-do with data checking (Check_snpname_order_hypothesis.R); scripts in Archive, which are obsolete but kept for reference; and Generate_and_plot_extended_components_using_snp_cutoff.R, used to asses value of confidence intervals among samples included in the extended component analysis.
- ensure ignored files are synced across work and personal laptops and finalise .gitignore