diff --git a/jupyter-book/cellular_structure/annotation.ipynb b/jupyter-book/cellular_structure/annotation.ipynb index 97d65f8a..c6dda883 100644 --- a/jupyter-book/cellular_structure/annotation.ipynb +++ b/jupyter-book/cellular_structure/annotation.ipynb @@ -406,7 +406,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, or meaning that this cell type is expected not to or to lowly express those markers." + "Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, meaning that this cell type is expected not to or to lowly express those markers." ] }, { @@ -1146,7 +1146,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. " + "The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. " ] }, { @@ -1200,8 +1200,8 @@ "source": [ "adata_celltypist = adata.copy() # make a copy of our adata\n", "adata_celltypist.X = adata.layers[\"counts\"] # set adata.X to raw counts\n", - "sc.pp.normalize_per_cell(\n", - " adata_celltypist, counts_per_cell_after=10**4\n", + "sc.pp.normalize_total(\n", + " adata_celltypist, target_sum=10**4\n", ") # normalize to 10,000 counts per cell\n", "sc.pp.log1p(adata_celltypist) # log-transform\n", "# make .X dense instead of sparse, for compatibility with celltypist:\n", @@ -1610,7 +1610,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotation are wrong." + "This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotations are wrong." ] }, { diff --git a/jupyter-book/cellular_structure/clustering.ipynb b/jupyter-book/cellular_structure/clustering.ipynb index cbdc3105..1de147ad 100644 --- a/jupyter-book/cellular_structure/clustering.ipynb +++ b/jupyter-book/cellular_structure/clustering.ipynb @@ -25,11 +25,11 @@ "Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n", "\n", "In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. \n", - "We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principle-component analysis and the similarity scoring is then based on Euclidean distances. \n", + "We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. \n", "\n", - "In the KNN graph consists of nodes reflecting the cells in the dataset. We first calculate a Euclidean distance matrix on the PC-reduced expression space for all cells and then connect each cell to its K most similar cells. Usually, K is set to values between 5 and 100 depending on the size of the dataset. The KNN graph reflects the underlying topology of the expression data by representing dense regions with respect to expression space also as densely connected regions in the graph {cite}`wolf_paga_2019`. Dense regions in the KNN-graph are detected by community detection methods like Leiden and Louvain{cite}`blondel_fast_2008`. \n", + "The KNN graph consists of nodes reflecting the cells in the dataset. We first calculate a Euclidean distance matrix on the PC-reduced expression space for all cells and then connect each cell to its K most similar cells. Usually, K is set to values between 5 and 100 depending on the size of the dataset. The KNN graph reflects the underlying topology of the expression data by representing dense regions with respect to expression space also as densely connected regions in the graph {cite}`wolf_paga_2019`. Dense regions in the KNN-graph are detected by community detection methods like Leiden and Louvain{cite}`blondel_fast_2008`. \n", "\n", - "The Leiden algorithm is as an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ({cite}`du_systematic_2018, freytag_comparison_2018, weber_comparison_2016`). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred. \n", + "The Leiden algorithm is an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ({cite}`du_systematic_2018, freytag_comparison_2018, weber_comparison_2016`). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred. \n", "\n", "We, therefore, propose to use the Leiden algorithm{cite}`traag_louvain_2019` on single-cell k-nearest-neighbour (KNN) graphs to cluster single-cell datasets. \n", "\n", @@ -39,7 +39,7 @@ "\n", "\"Clustering\n", "\n", - "The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. \n", + "The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node forms its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. \n", "\n", ":::\n", "\n", diff --git a/jupyter-book/cellular_structure/integration.ipynb b/jupyter-book/cellular_structure/integration.ipynb index 76053885..91983378 100644 --- a/jupyter-book/cellular_structure/integration.ipynb +++ b/jupyter-book/cellular_structure/integration.ipynb @@ -2175,7 +2175,7 @@ "id": "f7c4a891", "metadata": {}, "source": [ - "This integration is also improved compared to the unintegrated data with cell identities grouped together but we sill see some shifts between batches." + "This integration is also improved compared to the unintegrated data with cell identities grouped together but we still see some shifts between batches." ] }, { diff --git a/jupyter-book/introduction/analysis_tools.ipynb b/jupyter-book/introduction/analysis_tools.ipynb index d23cc7a5..6648b4a8 100644 --- a/jupyter-book/introduction/analysis_tools.ipynb +++ b/jupyter-book/introduction/analysis_tools.ipynb @@ -2687,7 +2687,7 @@ "source": [ "We can write a simple function to run principal component analysis on such a concatenated matrix. MuData object provides a place to store multimodal embeddings: MuData`.obsm`. It is similar to how the embeddings generated on individual modalities are stored, only this time it is saved inside the MuData object rather than in AnnData`.obsm`.\n", "\n", - "To calculate for example a principal component analysis (PCA) for the joint values of the modalities, we horizontally stack the values stored in the individual modalities and then perform the PCA on the stacked matrix. This is possible because the number of observations matches across modalities (remember, the number of features does per modality does not have to match)." + "To calculate for example a principal component analysis (PCA) for the joint values of the modalities, we horizontally stack the values stored in the individual modalities and then perform the PCA on the stacked matrix. This is possible because the number of observations matches across modalities (remember, the number of features per modality does not have to match)." ] }, { @@ -2742,7 +2742,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable. This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as multi-omics integration methods. In the following section we will introduce muon which provides many tools to preprocess unimodal data besides beyond RNA-Seq and multi-omics integration methods." + "In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable. This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as multi-omics integration methods. In the following section we will introduce muon which provides many tools to preprocess unimodal data beyond RNA-Seq and multi-omics integration methods." ] }, { diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index e860a16a..a5dc6aa5 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -32,7 +32,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", + "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", "\n", "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully, lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages." ] @@ -85,7 +85,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow](https://www.nextflow.io/index.html) or [snakemake](https://snakemake.readthedocs.io/en/stable/)). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages." + "The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [Nextflow](https://www.nextflow.io/index.html) or [snakemake](https://snakemake.readthedocs.io/en/stable/)). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages." ] }, { @@ -213,7 +213,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [Bioconductor **{zellkonverter}** package](https://bioconductor.org/packages/zellkonverter/) helps makes this easier by using the [**{basilisk}** package](https://bioconductor.org/packages/basilisk/) to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python." + "The [Bioconductor **{zellkonverter}** package](https://bioconductor.org/packages/zellkonverter/) helps make this easier by using the [**{basilisk}** package](https://bioconductor.org/packages/basilisk/) to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python." ] }, { diff --git a/jupyter-book/preprocessing_visualization/dimensionality_reduction.ipynb b/jupyter-book/preprocessing_visualization/dimensionality_reduction.ipynb index 470f64e6..904d7d0a 100644 --- a/jupyter-book/preprocessing_visualization/dimensionality_reduction.ipynb +++ b/jupyter-book/preprocessing_visualization/dimensionality_reduction.ipynb @@ -11,7 +11,7 @@ "As previously mentioned, scRNA-seq is a high-throughput sequencing technology that produces datasets with high dimensions in the number of cells and genes. This immediately points to the fact that scRNA-seq data suffers from the 'curse of dimensionality'. \n", "\n", "```{admonition} Curse of dimensionality\n", - "The Curse of dimensionality was first brought up by R. Bellman {cite}`bellman1957dynamic` and descibes the problem that in theory high-dimensional data contains more information, but in practice this is not the case. Higher dimensional data often contains more noise and redundancy and therefore adding more information does not provide benefits for downstream analysis steps. \n", + "The Curse of dimensionality was first brought up by R. Bellman {cite}`bellman1957dynamic` and describes the problem that in theory high-dimensional data contains more information, but in practice this is not the case. Higher dimensional data often contains more noise and redundancy and therefore adding more information does not provide benefits for downstream analysis steps. \n", "```\n", "\n", "Not all genes are informative and are important for the task of cell type clustering based on their expression profiles. We already aimed to reduce the dimensionality of the data with feature selection, as a next step one can further reduce the dimensions of single-cell RNA-seq data with dimensionality reduction algorithms. These algorithms are an important step during preprocessing to reduce the data complexity and for visualization. Several dimensionality reduction techniques have been developed and used for single-cell data analysis.\n", @@ -88,7 +88,7 @@ "\n", "## PCA\n", "\n", - "In our dataset each cell is a vector of a `n_var`-dimensional vector space spanned by some orthonormal basis. As scRNA-seq suffers from the 'curse of dimensionality', we know that not all features are important to understand the underlying dynamics of the dataset and that there is an inherent redundancy{cite}`grun2014validation`. PCA creates a new set of uncorrelated variables, so called principle components (PCs), via an orthogonal transformation of the original dataset. The PCs are linear combinations of features in the original dataset and are ranked with decreasing order of variance to define the transformation. Through the ranking usually the first PC amounts to the largest possible variance. PCs with the lowest variance are discarded to effectively reduce the dimensionality of the data without losing information.\n", + "In our dataset each cell is a vector of a `n_var`-dimensional vector space spanned by some orthonormal basis. As scRNA-seq suffers from the 'curse of dimensionality', we know that not all features are important to understand the underlying dynamics of the dataset and that there is an inherent redundancy{cite}`grun2014validation`. PCA creates a new set of uncorrelated variables, so called principal components (PCs), via an orthogonal transformation of the original dataset. The PCs are linear combinations of features in the original dataset and are ranked with decreasing order of variance to define the transformation. Through the ranking usually the first PC amounts to the largest possible variance. PCs with the lowest variance are discarded to effectively reduce the dimensionality of the data without losing information.\n", "\n", "PCA offers the advantage that it is highly interpretable and computationally efficient. However, as scRNA-seq datasets are rather sparse due to dropout events and therefore highly non-linear, visualization with the linear dimensionality reduction technique PCA is not very appropriate. PCA is typically used to select the top 10-50 PCs which are used for downstream analysis tasks.\n" ] diff --git a/jupyter-book/preprocessing_visualization/feature_selection.ipynb b/jupyter-book/preprocessing_visualization/feature_selection.ipynb index e347a6af..064fc958 100644 --- a/jupyter-book/preprocessing_visualization/feature_selection.ipynb +++ b/jupyter-book/preprocessing_visualization/feature_selection.ipynb @@ -212,7 +212,7 @@ "id": "37330365-94b4-44b6-92a7-09839228fbc9", "metadata": {}, "source": [ - "As a next step, we now sort the vector an select the top 4,000 highly deviant genes and save them as an additional column in `.var` as 'highly_deviant'. We additionally save the computed binomial deviance in case we want to sub-select a different number of highly variable genes afterwards. " + "As a next step, we now sort the vector and select the top 4,000 highly deviant genes and save them as an additional column in `.var` as 'highly_deviant'. We additionally save the computed binomial deviance in case we want to sub-select a different number of highly variable genes afterwards. " ] }, { diff --git a/jupyter-book/preprocessing_visualization/normalization.ipynb b/jupyter-book/preprocessing_visualization/normalization.ipynb index cde82306..87e7191a 100644 --- a/jupyter-book/preprocessing_visualization/normalization.ipynb +++ b/jupyter-book/preprocessing_visualization/normalization.ipynb @@ -141,7 +141,7 @@ "Overdispersion describes the presence of a greater variability in the dataset than one would expect.\n", "```\n", "\n", - "The shifted logarithm is a fast normalization technique, outperforms other methods for uncovering the latent structure of the dataset (especially when followed by principal component analysis) and works beneficial for stabilizing variance for subsequent dimensionality reduction and identification of differentially expressed genes. We will now inspect how to apply this normalization method to our dataset. The shifted logarithm can be conveniently called with scanpy by running `pp.normalized_total` with `target_sum=None`. We are setting the `inplace` parameter to `False` as we want to explore three different normalization techniques in this tutorial. The second step now uses the scaled counts and we obtained the first normalized count matrix.\n" + "The shifted logarithm is a fast normalization technique, outperforms other methods for uncovering the latent structure of the dataset (especially when followed by principal component analysis) and works beneficial for stabilizing variance for subsequent dimensionality reduction and identification of differentially expressed genes. We will now inspect how to apply this normalization method to our dataset. The shifted logarithm can be conveniently called with scanpy by running `pp.normalize_total` with `target_sum=None`. We are setting the `inplace` parameter to `False` as we want to explore three different normalization techniques in this tutorial. The second step now uses the scaled counts and we obtained the first normalized count matrix.\n" ] }, { @@ -380,7 +380,7 @@ "\n", "The third normalization technique we are introducing in this chapter is the analytic approximation of Pearson residuals. This normalization technique was motivated by the observation that cell-to-cell variation in scRNA-seq data might be confounded by biological heterogeneity with technical effects. The method utilizes Pearson residuals from 'regularized negative binomial regression' to calculate a model of technical noise in the data. It explicitly adds the count depth as a covariate in a generalized linear model. {cite}`norm:germain_pipecomp_2020` showed in an independent comparison of different normalization techniques that this method removed the impact of sampling effects while preserving cell heterogeneity in the dataset. Notably, analytic Pearson residuals do not require downstream heuristic steps like pseudo count addition or log-transformation.\n", "​\n", - "The output of this method are normalized values that can be positive or negative. Negative residuals for a cell and gene indicate that less counts are observed than expected compared to the gene's average expression and cellular sequencing depth. Positive residuals indicate the more counts respectively. Analytic Pearon residuals are implemented in scanpy and can directly be calculated on the raw count matrix.\n" + "The output of this method are normalized values that can be positive or negative. Negative residuals for a cell and gene indicate that less counts are observed than expected compared to the gene's average expression and cellular sequencing depth. Positive residuals indicate the more counts respectively. Analytic Pearson residuals are implemented in scanpy and can directly be calculated on the raw count matrix.\n" ] }, { diff --git a/jupyter-book/preprocessing_visualization/quality_control.ipynb b/jupyter-book/preprocessing_visualization/quality_control.ipynb index ee9e1946..5caa5c4a 100644 --- a/jupyter-book/preprocessing_visualization/quality_control.ipynb +++ b/jupyter-book/preprocessing_visualization/quality_control.ipynb @@ -801,7 +801,7 @@ "id": "16ed4680-e019-400d-8c39-7e8fc7937f31", "metadata": {}, "source": [ - "We can now launch the doublet detection by using data_mat as input to scDblFinder within a SingleCellExperiment. scBblFinder adds several columns to the colData of sce. Three of them might be interesting for the analysis:\n", + "We can now launch the doublet detection by using data_mat as input to scDblFinder within a SingleCellExperiment. scDblFinder adds several columns to the colData of sce. Three of them might be interesting for the analysis:\n", "\n", "* `sce$scDblFinder.score`: the final doublet score (the higher the more likely that the cell is a doublet)\n", "\n",