Raw data and analysis scripts for the ClonoCluster paper. Includes additional figures generated and analyses that were not present in the publication.
For the actual distributed open source software, including worked examples, check out this repository.
In addition to installing ClonoCluster
, this workflow requires the following R packages from CRAN:
-
magrittr
-
data.table
-
ggplot2
-
entropy
-
WebGestaltR
Due to size limitations, the raw data for this study cannot be directy hosted on GitHub. I have compressed it and uploaded it with the release of this package. You will need to download (click here) and install pixz
to uncompress this package.
pixz -d ClonoCluster_raw_data.tpxz
Then untar the package.
tar -xvf ClonoCluster_raw_data.tar
Now copy the data folder into the repository.
cp Data_genes/ ./ClonoCluster_paper/Data/
There are 9 replicates from six sources in the raw data, the short names for them are as follows:
-
YG1-3: WM989 melanoma cells treated with low and high doses of vemurafenib (one low dose and two high dose), link.
-
Kleind2ws: murine hematopoietic stem cells differentiating in vitro harvested at day 2, link.
-
CJ: human induced pluripotent stem cell line directed toward cardiomyocyte fate, link.
-
WM9891-2: WM983b melanoma cells treated with vemurafenib (2 replicates), link.
-
MDA1-2: MDA breast cancer cells treated with Paclitaxel, link
-
Data/
- raw data folder-
Data_barcodes/
- barcode assignments for each dataset, raw data -
Data_genes/
- normalized/scaled count matrices for each dataset in gene by cell format, raw data, included as a compressed file with the GitHub release of this package
-
-
Paper/
- Scripts to process raw data and generate figuresRun_full_analysis.R
- script that when run will run the complete analysis and generate all processed data and figures by running scripts in theextractionScripts/
andplotScripts/
directories. Fully annotated and may be run line by line or sourced from the package root directory.
setwd("ClonoCluster_paper/") source("Run_full_analysis.R")
-
extractionScripts/
- scripts to generate processed data-
Constants.R
- Script with paths and variables for all analyses -
Long_alluvia.R
- Generate cluster assignments for a range of alphas, and plot long Sankeys. -
Marker_analysis.R
- Identify cluster markers for hybrid clusters, reorganization markers, and perform gene set overrepresentation analysis
-
-
extractedData/
- processed data, generated from the raw data by the scripts inextractionScripts/
-
*_cluster_assignments.txt
- hybrid clustering assignments, generated byLong_alluvia.R
-
*_auc_linclust.txt
- ROC-derived AUC values for all possible cluster markers for each clustering level, fromMarker_analysis.R
. -
*_auc_barcodes.txt
- AUC to identify reorganization markers between a paired transcriptome cluster and a hybrid cluster, fromMarker_analysis.R
. -
*_rearrangment_ORA.txt
- Output from WebGestaltR overrepresentation analysis of reorganization markers, fromMarker_analysis.R
.
-
-
plotScripts/
- Output figures from analysis scripts, divided by subfolder for each script.-
cluster_size_plots.R
- generate line graphs of alpha vs number of clusters for all samples seen in figure 1A and Figure S3. -
Confusion_stats.R
- generate boxplots of alpha vs cohen's kappa, as in Figure S2B. -
Fig4_clusters.R
- generate warped umaps found in Figure 4B-C. -
Fig4.R
- generate warp factor umaps and PC distributions from simulated data as found in Figure 4A and S6. -
geneset_hm.R
- generate heatmap from overrepresentation analysis results, as in Figure 3. -
grid_graph.R
- generate simulated representation of network graphs, Figure S1B. -
Heme_comp.R
- Plots of entropy of hematopoietic cell types from Weinreb et al., Figure S7. -
Labeled_umaps.R
- generate UMAPs and Sankey plots showing combined warp factor and hybrid clustering in Figure 5. -
Marker_box_and_venn.R
- generate boxplots of top cluster marker strength for all samples as in Figure 2A and Figure S5B, as well as venn diagrams for top cluster marker overlap as in Figure S4. -
Marker_sankey.R
- Sankeys of marker positivity, as in Figure 2B-D. -
Marker_turnover_hm.R
- Heatmaps of top cluster marker AUC at each alpha level, as in Figure S5A. -
Model_edge_weights.R
- generate plots of network graph edge weight vs alpha, Figure S1A. -
Reorg_sankey.R
- Sankey plots showing schematic of how reorganization marker AUCs are determined, as in Figure S3B. -
Short_alluvia.R
- Sankey plots of just 3 clustering levels, transcriptome, high and low alpha, as in Figure 1E.
-
-
plots/
- Output from plot scripts, divided into sub folders.-
AUC_sankey/
-Marker_sankey.R
output, Sankeys of marker positivity across alpha levels -
Reorg_sankey/
-Reorg_sankey.R
output, representative sankeys of reorganization analysis -
cluster_size_plots/
-cluster_size_plots.R
output, plots of cluster number vs alpha level for all samples -
Confusion/
-Confusion_stats.R
output, cohen's kappa vs alpha level plots for all samples -
entropy/
-Heme_comp.R
output, single plot of entropy for each cell type with different clustering methods -
Gene_umaps/
-Labeled_umaps.R
output, UMAPs showing high alpha cluster of interest and contributing transcriptome clusters -
Long_alluvia/
-Long_alluvia.R
output, Sankey for all samples for all alpha iterations -
Short_alluvia/
-Short_alluvia.R
output, Sankey for all samples for Transcriptome, high alpha, and low alpha level -
top_markers_box/
-Marker_box_and_venn.R
output, boxplots of overall cluster marker AUC -
Turnover_hm/
-Marker_turnover_hm.R
output, heatmaps of the AUC of the union of top cluster markers at all alpha levels -
Venn/
-Marker_box_and_venn.R
output, venn diagrams of top cluster markers for all samples. -
warped/
-Fig4_clusters.R
output, demonstrations of UMAP warp factor on the datasets. -
reorg_hm/
-geneset_hm.R
output, heatmaps of gene set overrepresentation analysis results -
simulations/
-grid_graph.R
andFig4.R
output, simulated data results
-
-
finalFigures/
- pdfs of figures for the final manuscript-
F1.pdf
- C fromLong_alluvia.R
, D fromcluster_size_plots.R
, and E fromShort_alluvia.R
output -
F2.pdf
- A fromMarker_box_and_venn.R
, B-D fromMarker_sankey.R
output -
F3.pdf
- B fromReorg_sankey.R
, C fromMarker_turnover_hm.R
output -
F4.pdf
- A fromFig4.R
, B-C fromFig4_clusters.R
output -
F5.pdf
- fromLabeled_umaps.R
output -
Figure_S1.pdf
- A fromModel_edge_weights.R
and B fromgrid_graph.R
output -
Figure_S2.pdf
- A fromLong_alluvia.R
and B fromConfusion_stats.R
output -
Figure_S3.pdf
- fromcluster_size_plots.R
output -
Figure_S4.pdf
- fromMarker_box_and_venn.R
output -
Figure_S5.pdf
- A fromMarker_turnover_hm.R
, B fromMarker_box_and_venn.R
output -
Figure_S6.pdf
- fromFig4.R
output -
Figure_S7.pdf
- fromHeme_comp.R
output -
Dataset_table.pdf
- Pdf of Table 1, datasets used in the study
-
The entire analysis may be rerun from the package directory with source("Run_full_analysis.R")
. You may also run portions of the analysis following the walkthrough below:
- Set working directory.
setwd("ClonoCluster_paper")
- Establish needed variables with sample names and alpha analysis values.
source("Paper/extractionScripts/Constants.R")
- Get clusters across range of alphas from 0 to 1 and save intermediate output in extracted data folder. Plot long Sankey as in Figure 1C and Figure S2A.
source("Paper/extractionScripts/Long_alluvia.R")
- Perform marker identification for cluster and reorganization markers as well as gene set overrepresentation analysis, saves to extracted data folder.
source("Paper/extractionScripts/Marker_analysis.R")
- Generate plots of cluster sizes as in Figure 1D and Figure S3.
source("Paper/plotScripts/Cluster_size_plots.R")
- Generate Cohen's kappa plots as in Figure S2B.
source("Paper/plotScripts/Confusion_stats.R")
- Generate short Sankeys as in Figure 1E.
source("Paper/plotScripts/Short_alluvia.R")
- Generate top cluster marker fidelity boxplots (Figure 2A and Figure S5B) and venn diagrams of all marker overlap (Figure S4)
source("Paper/plotScripts/Marker_box_and_venn.R")
- Generate heatmaps of marker fidelity for top cluster markers as in Figure S5A
source("Paper/plotScripts/Marker_turnover_hm.R")
- Generate alluvia of interesting markers as in Figure 2B-D.
source("Paper/plotScripts/Marker_sankey.R")
- Generate sample Sankeys for the reorganization analysis (Figure 3B).
source("Paper/plotScripts/Reorg_sankey.R")
- Generate reorganization marker overrepresentation analysis heat maps (Figure 3C).
source("Paper/plotScripts/geneset_hm.R")
- Generate warped UMAPs from sample data and real data (Figure 4).
source("Paper/plotScripts/Fig4.R")
source("Paper/plotScripts/Fig4_clusters.R")
- Generate UMAP and Sankeys as in Figure 5, showing the combined hybrid clustering and warp factor.
source("Paper/plotScripts/Labeled_umaps.R")
- Generate supplemental analyses showing how ClonoCluster works (Figure S1).
source("Paper/plotScripts/Model_edge_weights.R") # show curves for how model influences edge weight at beta = 0.1
source("Paper/plotScripts/grid_graph.R") # simulation of network graphs with alpha
- Generate entropy analysis for hematopoietic sample.
source("Paper/plotScripts/Heme_comp.R")
Open an issue on this GitHub repository, contact @leeprichman, @arjunrajlaboratory, or email myself or Dr. Arjun Raj.