Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update differential_gene_expression.ipynb #256

Merged
merged 1 commit into from
Dec 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 18 additions & 20 deletions jupyter-book/conditions/differential_gene_expression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
"source": [
"A recent study highlighted the issue of pseudoreplication where inferential statistics is applied to biological replicates which are not statistically independent. Failing to account for the inherent correlation of replicates (cells from the same individual) inflates the false discovery rate (FDR){cite}`de:Squair2021, Zimmerman2021, Junttila2022`. Therefore, batch effect correction or the aggregation of cell-type-specific expression values within an individual through either a sum, mean or random effect per individual, that is pseudobulk generation, should be applied prior to DGE analysis to account for within-sample correlations {cite}`Zimmerman2021`. Generally, both, pseudobulk methods with sum aggregation such as edgeR, DESeq2, or Limma{cite}`Ritchie2015` and mixed models such as MAST with random effect setting were found to be superior compared to naive methods, such as the popular Wilcoxon rank-sum test or Seurat’s {cite}`de:Hao2021` latent models, which do not account for them{cite}`Junttila2022`.\n",
"\n",
"In a matters arising to the Zimmerman paper, Murphy et al. critically examined the Zimmerman benchmarking strategy and improved it {cite}`Murphy_2022`. They come to the conclusion that pseudobulk methods perform best but whether sum or mean aggregation works better requires further investigation."
"In matters arising from the Zimmerman paper, Murphy et al. critically examined the Zimmerman benchmarking strategy and improved it {cite}`Murphy_2022`. They came to the conclusion that pseudobulk methods perform best but whether sum or mean aggregation works better requires further investigation."
]
},
{
Expand All @@ -83,7 +83,7 @@
"id": "abca2aa3"
},
"source": [
"Hence, in this notebook, we demonstrate how to use two tools for DE analysis: edgeR with a quasi likelihood test and MAST with random effects. Since both edgeR and MAST are implemented in R, we use the anndata2ri package to be able to simultaneously work with AnnData objects in Python and SingeCellExperiment objects in R.\n",
"Hence, in this notebook, we demonstrate how to use two tools for DE analysis: edgeR with a quasi-likelihood test and MAST with random effects. Since both edgeR and MAST are implemented in R, we use the anndata2ri package to be able to simultaneously work with AnnData objects in Python and SingeCellExperiment objects in R.\n",
"\n",
"For each of the two methods, we show two use cases: how to run the analysis on full data and perform testing on multiple cell types and on one specific cell type."
]
Expand Down Expand Up @@ -537,11 +537,11 @@
"id": "cfbb967a"
},
"source": [
"edgeR is a differential gene expression testing tool implemented in R which was initially designed for bulk gene expression data. It implements a wide range of statistical methodology based on the negative binomial distribution, empirical Bayes estimation, exact tests, generalized linear models (GLMs) and quasi-likelihood tests.\n",
"edgeR is a differential gene expression testing tool implemented in R which was initially designed for bulk gene expression data. It implements a wide range of statistical methodologies based on the negative binomial distribution, empirical Bayes estimation, exact tests, generalized linear models (GLMs) and quasi-likelihood tests.\n",
"\n",
"For more details please read the original publication{cite}`de:Robinson2010`.\n",
"\n",
"Here, we will be using the quasi likelihood test because it accounts for the uncertainty of the dispersion estimates. In contrast, the exact test assumes that the estimated dispersion is the true value, which can result in some inaccuracy. Additionally, the quasi likelihood GLMs are more flexible when it comes to experimental design."
"Here, we will be using the quasi-likelihood test because it accounts for the uncertainty of the dispersion estimates. In contrast, the exact test assumes that the estimated dispersion is the true value, which can result in some inaccuracy. Additionally, the quasi-likelihood GLMs are more flexible when it comes to experimental design."
]
},
{
Expand Down Expand Up @@ -635,7 +635,7 @@
},
"source": [
"Now, let's define the function we need to aggregate single cells into pseudo-replicates:\n",
"- `aggregate_and_filter` is a function that creates an AnnData object with one pseudo-replicate for each donor for a specified subpopulation from the original single-cell AnnData object. Here we also filter out donors that have fewer than 30 cells for the specified population.\n",
"- `aggregate_and_filter` is a function that creates an AnnData object with one pseudo-replicate for each donor for a specified subpopulation from the original single-cell AnnData object. Here, we also filter out donors that have fewer than 30 cells for the specified population.\n",
"- by changing the `replicates_per_patient` parameter, several (n) pseudo-replicates can be created for each sample; cells are then split into n subsets of roughly equal sizes."
]
},
Expand Down Expand Up @@ -741,7 +741,7 @@
" print(\"\")\n",
" # normalize\n",
" y <- calcNormFactors(y)\n",
" # create a vector that is concatentation of condition and cell type that we will later use with contrasts\n",
" # create a vector that is a concatenation of condition and cell type that we will later use with contrasts\n",
" group <- paste0(colData(adata_)$label, \".\", colData(adata_)$cell_type)\n",
" replicate <- colData(adata_)$replicate\n",
" # create a design matrix: here we have multiple donors so also consider that in the design matrix\n",
Expand All @@ -759,7 +759,7 @@
"id": "c9599906",
"metadata": {},
"source": [
"Now we defined all the funtions we need, so we can proceed with creating pseudobulks. We might want to look at available metadata later and therefore keep it in the AnnData object."
"Now we have defined all the functions we need, so we can proceed with creating pseudobulks. We might want to look at available metadata later and therefore keep it in the AnnData object."
]
},
{
Expand Down Expand Up @@ -875,7 +875,7 @@
"id": "e6303365"
},
"source": [
"The validity of differential gene expression results highly depend on the capturement of the major axis of variations in the statistical model. Intermediate data exploration steps such as principal component analysis (PCA) or multidimensional scaling (MDS) on pseudobulk samples allow for the identification of the sources of variation and thus can guide the construction of corresponding design and contrast matrices that model the data{cite}`Law2020`. \n",
"The validity of differential gene expression results highly depends on the capture of the major axis of variations in the statistical model. Intermediate data exploration steps such as principal component analysis (PCA) or multidimensional scaling (MDS) on pseudobulk samples allow for the identification of the sources of variation and thus can guide the construction of corresponding design and contrast matrices that model the data{cite}`Law2020`. \n",
"\n",
"Failing to account for multiple sources of biological variability for experiments which include biological replicates will inflate the FDR{cite}`Thurman2021`,{cite}`Lähnemann2020`. While increasing the number of cells per individual increases the precision, it has a limited effect on the power for the detection of differences across individuals. Therefore, the best way to increase statistical power is to increase the number of independent experimental samples{cite}`Zimmerman2021`. \n",
"\n",
Expand All @@ -889,7 +889,7 @@
"id": "5d9dbe8b"
},
"source": [
"We perform very basic EDA on the created pseudo-replicates to check if some patients/pseudobulks are outliers that we need to exclude not to bias the DE results. We save the raw counts in the `'counts'` layer, then normalize the counts and calculate the PCA coordinates for the normalized pseudobulk counts."
"We perform very basic EDA on the created pseudo-replicates to check if some patients/pseudobulks are outliers that we need to exclude so as not to bias the DE results. We save the raw counts in the `'counts'` layer, then normalize the counts and calculate the PCA coordinates for the normalized pseudobulk counts."
]
},
{
Expand Down Expand Up @@ -965,9 +965,9 @@
"id": "b4c14502",
"metadata": {},
"source": [
"We observe separation of cell types on the PCA plots as well as the separation into stimulated and unstimulated cells. Other covariates (batch) do not seem to be clearly correlated with the PCA components so we do not include any of them into our design matrix.\n",
"We observe separation of cell types on the PCA plots as well as the separation into stimulated and unstimulated cells. Other covariates (batch) do not seem to be clearly correlated with the PCA components so we do not include any of them in our design matrix.\n",
"\n",
"As mentioned above, edgeR takes raw couts as input, so we put counts back into the `.X` field before we proceed."
"As mentioned above, edgeR takes raw counts as input, so we put counts back into the `.X` field before we proceed."
]
},
{
Expand Down Expand Up @@ -1001,7 +1001,7 @@
"id": "9a98047a",
"metadata": {},
"source": [
"We run the pipeline on CD14+ Monocytes subset of the data, as it was shown in the paper that the highest number of DE genes was indetified in this subpopulation."
"We run the pipeline on CD14+ Monocytes subset of the data, as it was shown in the paper that the highest number of DE genes was identified in this subpopulation."
]
},
{
Expand Down Expand Up @@ -1101,7 +1101,7 @@
"id": "8d8bc75a"
},
"source": [
"Since we did not enter our analysis with a prior assumption that a specific gene will be up- or downregulated, we need visualizations to make sense of the DGE results. MDS plots allow for a high level overview. Commonly, we expect a separation between samples from different conditions as can be seen in the following plot for our results."
"Since we did not enter our analysis with a prior assumption that a specific gene will be up or downregulated, we need visualizations to make sense of the DGE results. MDS plots allow for a high level overview. Commonly, we expect a separation between samples from different conditions as can be seen in the following plot for our results."
]
},
{
Expand Down Expand Up @@ -1145,7 +1145,7 @@
"\n",
"Genes with low abundance generally exhibit larger BCV as read count measurements are more uncertain for low abundance genes. On the other hand, genes with high average expression are quantified more reliably, thus they generally have lower variability and hence lower BCV. Use this plot to detect outlier genes or pinpoint other experimental factors that may need to be reflected in the design matrix. For example, a group of genes with high average expression and high BCV appearing in the top right corner of the plot can flag experimental stress, contamination etc, particularly if they belong to similar gene families.\n",
"\n",
"BCV is the square root of dispersion. The distance between tagwise and common bcv trend indicate if tagwise (gene-wise) dispersion estimates are highly variable (i.e. heterogenous gene expression). If tag-wise dispersion values are very heterogenous, less moderation is applied to capture heterogenity. The distance between any two points on those curves reflects how much the dispersion for a gene was shrunken towards the common trend. That is, it captures the amount of dispersion moderation applied. \n",
"BCV is the square root of dispersion. The distance between tagwise and common bcv trend indicates if tagwise (gene-wise) dispersion estimates are highly variable (i.e. heterogenous gene expression). If tag-wise dispersion values are very heterogeneous, less moderation is applied to capture heterogeneity. The distance between any two points on those curves reflects how much the dispersion for a gene was shrunken towards the common trend. That is, it captures the amount of dispersion moderation applied. \n",
"\n",
"In the BCV plot below, we see some low abundance genes with high bcv, but no high abundance genes with high bcv, which indicates that there should not be any further experimental considerations modelled into the design matrix."
]
Expand Down Expand Up @@ -1674,7 +1674,7 @@
"### Notes on edgeR\n",
"- Requires raw counts as input\n",
"- Requires pseudobulks from a single-cell experiment\n",
"- If there are several donors in the single-cell experiment and the user wants to account for the patient varianility, we recommend creating 2 or 3 pseudo-replicates for each patient and including patient information into the design matrix"
"- If there are several donors in the single-cell experiment and the user wants to account for the patient variability, we recommend creating 2 or 3 pseudo-replicates for each patient and including patient information into the design matrix"
]
},
{
Expand All @@ -1694,8 +1694,6 @@
"id": "a37c411f"
},
"source": [
"MAST models single-cell gene expression with a two-part generalized linear model. One part models the discrete expression rate of each gene across cells, whereas the other part models the conditional continuous expression level.\n",
"\n",
"The MAST framework models single-cell gene expression using a two-part generalized linear model. One component of MAST models the discrete expression rate of each gene across cells, while the other component models the conditional continuous expression level (conditional on the gene being expressed).\n",
"For more details please read the original publication{cite}`Finak2015`."
]
Expand Down Expand Up @@ -1884,7 +1882,7 @@
"As is the case whenever multiple statistical tests are performed, the obtained p-values for DGE tests over conditions must be corrected for multiple testing using, for example, a Benjamini-Hochberg correction{cite}`de:Lücken2019`,{cite}`Benjamini1995`.\n",
"\n",
"Similarly to edgeR analysis, we define a separate function that we use for the analysis:\n",
"- `find_de_MAST_RE` takes a SingleCellExperiment object as input and runs MAST with RE pipeline. The output of the function is table (pandas DataFrame in Python) which contains results of the analysis, e.g. log-fold change, p-value and FDR-corrected value for each gene."
"- `find_de_MAST_RE` takes a SingleCellExperiment object as input and runs MAST with RE pipeline. The output of the function is a table (pandas DataFrame in Python) which contains results of the analysis, e.g. log-fold change, p-value and FDR-corrected value for each gene."
]
},
{
Expand Down Expand Up @@ -2160,7 +2158,7 @@
"id": "1e737b9d"
},
"source": [
"We can visualize the results with a heatmap and a volcano plot. Each row of a heatmap corresponds to a gene and each column to a single-cell. The brighter the color is the higher is the expression of that gene in a particular cell. Since we only plot DE genes, we would like to see clear differences in expression between the two conditions. Volcano plots are often used to visualize results of statistical testing, and they show the change in expression on the x-axis (log-fold change) and statistical significance on the y-axis (FDR-corrected p-values). We color code the genes that have FDR-corrected p-value under 0.01 and log-fold change of over 1.5"
"We can visualize the results with a heatmap and a volcano plot. Each row of a heatmap corresponds to a gene and each column to a single-cell. The brighter the color the higher the expression of that gene in a particular cell. Since we only plot DE genes, we would like to see clear differences in expression between the two conditions. Volcano plots are often used to visualize the results of statistical testing, and they show the change in expression on the x-axis (log-fold change) and statistical significance on the y-axis (FDR-corrected p-values). We color code the genes that have FDR-corrected p-value under 0.01 and log-fold change of over 1.5"
]
},
{
Expand Down Expand Up @@ -2306,7 +2304,7 @@
"id": "7c35bfc0"
},
"source": [
"We observe that MAST identified 436 DEG with our given cut-offs for adjusted p-values and logfold change, while edgeR indentified 303 genes."
"We observe that MAST identified 436 DEG with our given cut-offs for adjusted p-values and logfold change, while edgeR identified 303 genes."
]
},
{
Expand Down
Loading