This repository contains all codes used in the paper: “Uncovering the hidden structure of dynamic T cell composition in peripheral blood during cancer immunotherapy: a topic modeling approach”
Below are three core R packages we used for LDA analysis
library(topicmodels)
library(slam)
library(tidytext)
We use the R package topicmodels for model inference, R package slam for preparing the required input data, R package tidytext for extracting the output data. The three R packages are important for reproducing the LDA analysis we described in the paper on other datasets.
There are also other R packages we used in the script, like Seurat package for UMAP visualization and clustering, ComplexHeatmap for heatmaps. You can also use other clustering methods, like FlowSOM, or other visualization tools instead.
In text mining, LDA requires the input as a document-by-term count matrix, where each row represent each document,each column represent each term, each entry in the matrix is the number of occurrences of each term (a word is a single occurrence of a term). Motivated by the similarities between text data mining and single-cell analysis, for single-cell analysis, LDA consider cells as words, cell types as terms, patient samples as documents, biological processes as topics.
Before applying the LDA model to single-cell datasets, we need to prepare the cell type-by-sample count matrix as the input of LDA.
One common approach to obtain the cell type count matrix is to pool all cells together and do the clustering (Before doing the pooled clustering, you may want to check if there is a batch effect between samples).
X50_single_cell_analysis.R
contains all codes that we used to cluster the 17M+ cells.
The Louvain method in Seurat package was used to cluster cells and prepare the cell type-by-sample matrix.
Cell types were manually annotated based on their marker expression.
For more detailed usage of the LDA method, please check the tutorial, where we show two examples of the application of LDA on single-cell dataset:
- scRNA-seq data of liver cancer patients. (Data from a Nature paper)
- Longitudinal flow cytometry data of melanoma patients. (this paper)
Additional data used to generate figures are provided in:
Peng, Xiyu (2023), “flow cytometry dataset of melanoma PBMC”, Mendeley Data, V1, doi: 10.17632/d7nkgfhc8z.1
If you find the method useful, please cite:
Peng, X., Lee, J., Adamow, M., Maher, C., Postow, M. A., Callahan, M. K., Panageas, K. S., & Shen, R. (2023). A topic modeling approach reveals the dynamic T cell composition of peripheral blood during cancer immunotherapy. Cell Reports Methods, 3(8), 100546. https://doi.org/10.1016/j.crmeth.2023.100546
If you have any problems, please contact:
Xiyu Peng ([email protected], [email protected])