deploy: a2f3eb8

theislab · Sep 13, 2023 · 45b773b · 45b773b
1 parent 608e050
commit 45b773b
Show file tree

Hide file tree

Showing 80 changed files with 2,068 additions and 2,081 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: e2709f8fcb02cc6e5ab48cf6d4634e14
+config: eac05c3289b1c5d246409fccd952babd
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_images/integration_103_1.png → _images/integration_102_1.png b/_images/integration_103_1.png → _images/integration_102_1.png
diff --git a/_images/integration_119_0.png → _images/integration_118_0.png b/_images/integration_119_0.png → _images/integration_118_0.png
diff --git a/_images/integration_123_1.png → _images/integration_122_1.png b/_images/integration_123_1.png → _images/integration_122_1.png
diff --git a/_images/integration_30_1.png → _images/integration_29_1.png b/_images/integration_30_1.png → _images/integration_29_1.png
diff --git a/_images/integration_36_1.png → _images/integration_35_1.png b/_images/integration_36_1.png → _images/integration_35_1.png
diff --git a/_images/integration_65_1.png → _images/integration_64_1.png b/_images/integration_65_1.png → _images/integration_64_1.png
diff --git a/_images/integration_73_1.png → _images/integration_72_1.png b/_images/integration_73_1.png → _images/integration_72_1.png
diff --git a/_images/integration_83_1.png → _images/integration_82_1.png b/_images/integration_83_1.png → _images/integration_82_1.png
diff --git a/_sources/cellular_structure/annotation.ipynb b/_sources/cellular_structure/annotation.ipynb
@@ -144,7 +144,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Load data:"
+    "## Load data"
    ]
   },
   {
@@ -329,7 +329,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To start we store our raw counts in .layers['counts'], so that we will still have access to them later if needed. We then set our adata.X to the SCRAN-normalized, log-transformed counts."
+    "To start we store our raw counts in `.layers['counts']`, so that we will still have access to them later if needed. We then set our `adata.X` to the scran-normalized, log-transformed counts."
    ]
   },
   {
@@ -1179,7 +1179,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Classifiers based on a wider set of genes. "
+    "### Classifiers based on a wider set of genes"
    ]
   },
   {
@@ -1794,7 +1794,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Annotation by mapping to a reference."
+    "### Annotation by mapping to a reference"
    ]
   },
   {
@@ -2349,15 +2349,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As you can see it has only 10 dimensions (in .X) which together represent the latent space embedding of the reference cells. Our query embedding that we calculated for our own data also has 10 dimensions. The 10 dimensions of the reference and query are the same and can be combined!<br>\n",
-    "Moreover, it has cell type labels in .obs['cell_type']. We will use these labels to annotate our own data."
+    "As you can see it has only 10 dimensions (in `.X`) which together represent the latent space embedding of the reference cells. Our query embedding that we calculated for our own data also has 10 dimensions. The 10 dimensions of the reference and query are the same and can be combined!<br>\n",
+    "Moreover, it has cell type labels in `.obs['cell_type']`. We will use these labels to annotate our own data."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To perform the label transfer, we will first concatenate the reference and query data using the 10-dimensional embedding. To get there, we will create the same type of AnnData object from our query data as we have from the reference (with the embedding under .X) and concatenate the two. With that, we can jointly analyze reference and query including doing transfer from one to the other."
+    "To perform the label transfer, we will first concatenate the reference and query data using the 10-dimensional embedding. To get there, we will create the same type of AnnData object from our query data as we have from the reference (with the embedding under `.X`) and concatenate the two. With that, we can jointly analyze reference and query including doing transfer from one to the other."
    ]
   },
   {
@@ -2528,7 +2528,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's perform the knn-based label transfer. "
+    "Let's perform the KNN-based label transfer. "
    ]
   },
   {
@@ -2972,7 +2972,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The uncertainty not only helps us identify regions where the algorithm is uncertain about which cell type a cell belongs to (e.g. because it falls in between two annotated phenotypes), but can also highlight unseen cell types or new cell states. For example, your reference might consist of heathly cells while your query could be from a diseased sample. The uncertainty score can then highlight disease-specific cell states, as they migh not have neighbors from the reference that consistently come from a single cell type. Especially when your reference is based on a large set of datasets, the uncertainty score is useful to flag parts of the query data that could be interesting to look into. Reference-based label transfer thus not only helps you annotate your data, but can also speed up exploration and interpretation of your data. However, like any metric, these uncertainty scores are often not perfect and in some cases fail to highlight new cell types or states. For a more extensive discussion of uncertainty metrics, see e.g. {cite}`anno:Engelmann2019`."
+    "The uncertainty not only helps us identify regions where the algorithm is uncertain about which cell type a cell belongs to (e.g. because it falls in between two annotated phenotypes), but can also highlight unseen cell types or new cell states. For example, your reference might consist of healthy cells while your query could be from a diseased sample. The uncertainty score can then highlight disease-specific cell states, as they might not have neighbors from the reference that consistently come from a single cell type. Especially when your reference is based on a large set of datasets, the uncertainty score is useful to flag parts of the query data that could be interesting to look into. Reference-based label transfer thus not only helps you annotate your data, but can also speed up exploration and interpretation of your data. However, like any metric, these uncertainty scores are often not perfect and in some cases fail to highlight new cell types or states. For a more extensive discussion of uncertainty metrics, see e.g. {cite}`anno:Engelmann2019`."
    ]
   },
   {

diff --git a/_sources/cellular_structure/clustering.ipynb b/_sources/cellular_structure/clustering.ipynb
diff --git a/_sources/cellular_structure/integration.ipynb b/_sources/cellular_structure/integration.ipynb
@@ -71,7 +71,7 @@
    "source": [
     "### Batch removal complexity\n",
     "\n",
-    "The removal of batch effects in scRNA-seq data has previously been divided into two subtasks: batch correction and data integration {cite}`Luecken2019-og`. These subtasks differ in the complexity of the batch effect that must be removed. Batch correction methods deal with batch effects between samples in the same experiment where cell identity compositions are consistent, and the effect is often quasi-linear. In contrast, data integration methods deal with complex, often nested, batch effects between datasets that may be generated with different protocols and where cell identities may not be shared across batches. While we use this distinction here we should not that these terms are often used interchangeably in general use. Given the differences in complexity, it is not surprising that different methods have been benchmarked as being optimal for these two subtasks."
+    "The removal of batch effects in scRNA-seq data has previously been divided into two subtasks: batch correction and data integration {cite}`Luecken2019-og`. These subtasks differ in the complexity of the batch effect that must be removed. Batch correction methods deal with batch effects between samples in the same experiment where cell identity compositions are consistent, and the effect is often quasi-linear. In contrast, data integration methods deal with complex, often nested, batch effects between datasets that may be generated with different protocols and where cell identities may not be shared across batches. While we use this distinction here we should note that these terms are often used interchangeably in general use. Given the differences in complexity, it is not surprising that different methods have been benchmarked as being optimal for these two subtasks."
    ]
   },
   {
@@ -572,14 +572,6 @@
     "```"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "8ee0a84f-eb5e-4e0e-b30d-799c0db6d809",
-   "metadata": {},
-   "source": [
-    "## Unintegrated data"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "e0528714-7701-4548-b1a7-c5d27bdde00f",
@@ -2234,7 +2226,7 @@
    "id": "a1a51a52",
    "metadata": {},
    "source": [
-    "The prepared AnnDAta is now available in R as a SingleCellExperiment object thanks to **anndata2ri**. Note that this is transposed compared to an AnnData object so our observations (cells) are now the columns and our variables (genes) are now the rows."
+    "The prepared AnnData is now available in R as a SingleCellExperiment object thanks to **anndata2ri**. Note that this is transposed compared to an AnnData object so our observations (cells) are now the columns and our variables (genes) are now the rows."
    ]
   },
   {

diff --git a/_sources/conditions/compositional.ipynb b/_sources/conditions/compositional.ipynb
@@ -662,7 +662,7 @@
     "\n",
     "If we ignore the compositionality of the data, and use univariate methods like Wilcoxon rank-sum tests or scDC, a method which performs differential cell-type composition analysis by bootstrap resampling{cite}`comp:Cao2019`, we may falsely perceive cell-type population shifts as statistically sound effects, although they were induced by inherent negative correlations of the cell-type proportions.\n",
     "\n",
-    "Furthermore, the subsampled data does not only give us one valid solution to our question. If both cell types B and C decreased by 1,000 cells in the diseased case, we would obtain the same representative samples of 600 cells as above. To get a unique result, we can fix a reference point for the data, which is assumed to be unchanged throughout all samples{cite}`Brill2019`. This can be a single cell type, an aggregation over multiple cell types such as the geometric mean, or a set of orthogonal bases{cite}Egozcue2003.\n",
+    "Furthermore, the subsampled data does not only give us one valid solution to our question. If both cell types B and C decreased by 1,000 cells in the diseased case, we would obtain the same representative samples of 600 cells as above. To get a unique result, we can fix a reference point for the data, which is assumed to be unchanged throughout all samples{cite}`Brill2019`. This can be a single cell type, an aggregation over multiple cell types such as the geometric mean, or a set of orthogonal bases {cite}`Egozcue2003`.\n",
     "\n",
     "While single-cell datasets of sufficient size and replicate number have only been around for a few years, the same statistical property has also been discussed in the context of microbial analysis{cite}`Gloor2017`. There, some popular approaches include ANCOM-BC {cite}`Lin2020` and ALDEx2 {cite}`Fernandes2014`. However, these approaches often struggle with single-cell datasets due to the small number of experimental replicates."
    ]
@@ -1907,7 +1907,7 @@
     }
    },
    "source": [
-    "The model setup and execution in tascCODA works analogous to scCODA, and also the free parameters for the reference and the formula are the same. Additionally, we can adjust the tree aggregation and model selection via the parameters `phi` and `lambda_1` in the `pen_args` argument (see {cite}Ostner2021 for more information). Here, we use an unbiased setting `phi=0` and a model selection that is slightly less strict than the default with `lambda_1=1.7`. We use cluster 18 as our reference, since it is almost identical to the set of Endocrine cells."
+    "The model setup and execution in tascCODA works analogous to scCODA, and also the free parameters for the reference and the formula are the same. Additionally, we can adjust the tree aggregation and model selection via the parameters `phi` and `lambda_1` in the `pen_args` argument (see {cite}`Ostner2021` for more information). Here, we use an unbiased setting `phi=0` and a model selection that is slightly less strict than the default with `lambda_1=1.7`. We use cluster 18 as our reference, since it is almost identical to the set of Endocrine cells."
    ]
   },
   {
@@ -3047,7 +3047,7 @@
     "A set of methods exist to detect compositional changes occuring in subpopulations of cells smaller than the cell type clusters, usually defined starting from a k-nearest neighbor (KNN) graph computed from similarities in the same low dimensional space used for clustering. \n",
     "\n",
     "- DA-seq computes, for each cell, a score based on the relative prevalence of cells from both biological states in the cell’s neighborhood, using a range of k values{cite}`Zhao2021`. The scores are used as input for a logistic classifier to predict the biological condition of each cell. \n",
-    "- Milo assigns cells to partially overlapping neighborhoods on the KNN graph, then differential abundance (DA) testing is performed modelling cell counts witgh a generalized linear model (GLM) {cite}`Dann2022`. \n",
+    "- Milo assigns cells to partially overlapping neighborhoods on the KNN graph, then differential abundance (DA) testing is performed modelling cell counts with a generalized linear model (GLM) {cite}`Dann2022`. \n",
     "- MELD calculates a relative likelihood estimate of observing each cell in every condition using graph-based density estimate{cite}`Burkhardt2021`. \n",
     "\n",
     "These methods have unique strenghts and weaknesses. Because it relies on logistic classification, DA-seq is designed for pairwise comparisons between two biological conditions, but can't be applied to test for differences associated with a continuous covariate (such as age or timepoints). DA-seq and Milo use the variance in the abundance statistic between replicate samples of the same condition to estimate the significance of the differential abundance, while MELD doesn't use this information. While considering consistency across replicates reduces the number of false positives driven by one or a few samples, all KNN-based methods are sensitive to a loss of information if the conditions of interest and confounders, defined by technical or experimental sources of variation, are strongly correlated. The impact of confounders can be mitigated using batch integration methods before KNN graph construction and/or incorporating the confounding covariates in the model for DA testing, as we discuss further in the example below. Another limitation of KNN-based methods to bare in mind is that cells in a neighborhood may not necessarily represent a specific, unique biological subpopulation, because a cellular state may span over multiple neighborhoods. Reducing k for the KNN graph or constructing a graph on cells from a particular lineage of interest can help mitigate this issue and ensure the predicted effects are robust to the choice of parameters and to the data subset used{cite}`Dann2022`. \n",
@@ -3391,7 +3391,7 @@
     }
    },
    "source": [
-    "At this point we need to check the median number of cells in each neighbourhood, to make sure the neighbourhoods contain enough cells to detect differences between samples"
+    "At this point we need to check the median number of cells in each neighbourhood, to make sure the neighbourhoods contain enough cells to detect differences between samples."
    ]
   },
   {
@@ -3621,7 +3621,7 @@
     }
    },
    "source": [
-    "This stores a neighbourhood-level AnnData object, where `nhood_adata.X` stores the number of cells from each sample in each neighbourhood"
+    "This stores a neighbourhood-level AnnData object, where `nhood_adata.X` stores the number of cells from each sample in each neighbourhood."
    ]
   },
   {
@@ -3708,7 +3708,7 @@
     }
    },
    "source": [
-    "### Run differential abundance test on neighbourhoods"
+    "### Run differential abundance test on neighbourhoods"
    ]
   },
   {
@@ -3859,7 +3859,7 @@
    "source": [
     "For each neighbourhood, we calculate a set of statistics. The most important ones to understand are:\n",
     "- **log-Fold Change (logFC):** this represents the effect size of the difference in cell abundance and corresponds to the coefficient associated with the condition of interest in the GLM. If logFC > 0 the neighbourhood is enriched in cells from the condition of interest, if logFC < 0 the neighbourhood is depleted in cells from the condition of interest. \n",
-    "- **Uncorrected p-value (PValue):** this is the p-value for the QLF test before multiple testing correction\n",
+    "- **Uncorrected p-value (PValue):** this is the p-value for the QLF test before multiple testing correction.\n",
     "- **SpatialFDR:** this is the p-value adjusted for multiple testing to limit the false discovery rate. This is calculated adapting the weighted Benjamini-Hochberg (BH) correction introduced by Lun et al {cite}`lun2017`, which accounts for the fact that because neighbourhoods are partially overlapping (i.e. one cell can belong to multiple neighbourhoods) the DA tests on different neighbourhoods are not completely independent. In practice, the BH correction is weighted by the reciprocal of the distance to the k-th nearest neighbor to the index cell (stored in `kth_distance`), which is used as a proxy for the amount of overlap with other neighbourhoods. You might notice that the SpatialFDR values are always lower or equal to the FDR values, calculated with a conventional BH correction."
    ]
   },
@@ -3967,7 +3967,7 @@
    },
    "source": [
     "1. The **P-value histogram** shows the distribution of P-values before multiple testing correction. By definition, we expect the p-values under the null hypothesis (> significance level) to be uniformly distributed, while the peak of p-values close to zero represents the significant results. This gives you an idea of how conservative your test is, and it might help to spot early some pathological cases. For example, if the distribution of P-values looks bimodal, with a second peak close to 1, this might indicate that you have a large number of neighbourhoods with no variance between replicates of one condition (e.g. all replicates from one condition have 0 cells) which might indicate a residual batch effect or that you need to increase the size of neighbourhoods; if the p-value histogram is left-skewed this might indicate a [confounding covariate](https://github.com/MarioniLab/miloR/issues/220#issuecomment-1140812805) has not been accounted for in the model. For other pathological cases and possible interpretations see [this blogpost](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/).\n",
-    "2. For each neighbourhood we plot the uncorrected P-Value VS the p-value controlling for the Spatial FDR. Here we expect the adjusted p-values to be larger (so points above the diagonal). If the FDR correction is especially severe (i.e. many values close to 1) this might indicate a pathological case. You might be testing on too many neighbourhoods (you can reduce `prop` in `milo.make_nhoods`) or there might be too much overlap between neighbourhoods (you might need to decrease _k_ when constructing the KNN graph)\n",
+    "2. For each neighbourhood we plot the uncorrected P-Value VS the p-value controlling for the Spatial FDR. Here we expect the adjusted p-values to be larger (so points above the diagonal). If the FDR correction is especially severe (i.e. many values close to 1) this might indicate a pathological case. You might be testing on too many neighbourhoods (you can reduce `prop` in `milo.make_nhoods`) or there might be too much overlap between neighbourhoods (you might need to decrease _k_ when constructing the KNN graph).\n",
     "3. The **volcano plot** gives us an idea of how many neighbourhoods show significant DA after multiple testing correction ( - log(SpatialFDR) > 1) and shows how many neighbourhoods are enriched or depleted of cells from the condition of interest.  \n",
     "4. The **MA plot** shows the dependency between average number of cells per sample and the log-Fold Change of the test. In a balanced scenario, we expect points to be concentrated around logFC = 0, otherwise the shift might indicate a strong imbalance in average number of cells between samples from different conditions. For more tips on how to interpret the MA plot see https://github.com/MarioniLab/miloR/issues/208."
    ]