Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed some typos #303

Merged
merged 14 commits into from
Nov 4, 2024
10 changes: 5 additions & 5 deletions jupyter-book/cellular_structure/annotation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, or meaning that this cell type is expected not to or to lowly express those markers."
"Now show expression of the markers using the calculated UMAP. We'll limit ourselves to B/plasma cell subtypes for this example. Note from the marker dictionary above that there are three negative markers in our list: IGHD and IGHM for B1 B, and PAX5 for plasmablasts, meaning that this cell type is expected not to or to lowly express those markers."
]
},
{
Expand Down Expand Up @@ -1146,7 +1146,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. "
"The aforementioned points highlight possible disadvantages of using classifiers, depending on the training data and model type. Nonetheless, there are several important advantages of using pre-trained classifiers to annotate your data. First, it is a fast and easy way to annotate your data. The annotation does not require the downloading nor preprocessing of the training data and sometimes merely involves the upload of your data to an online webpage. Second, these methods don't rely on a partitioning of your data into clusters, as the manual annotation does. Third, pre-trained classifiers enable you to directly leverage the knowledge and information from previous studies, such as a high quality annotation. And finally, using such classifiers can help with harmonizing cell-type definitions across a field, thereby clearing the path towards a field-wide consensus on these definitions. "
]
},
{
Expand Down Expand Up @@ -1200,8 +1200,8 @@
"source": [
"adata_celltypist = adata.copy() # make a copy of our adata\n",
"adata_celltypist.X = adata.layers[\"counts\"] # set adata.X to raw counts\n",
"sc.pp.normalize_per_cell(\n",
" adata_celltypist, counts_per_cell_after=10**4\n",
"sc.pp.normalize_total(\n",
" adata_celltypist, target_sum=10**4\n",
") # normalize to 10,000 counts per cell\n",
"sc.pp.log1p(adata_celltypist) # log-transform\n",
"# make .X dense instead of sparse, for compatibility with celltypist:\n",
Expand Down Expand Up @@ -1610,7 +1610,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotation are wrong."
"This dendrogram partly reflects prior knowledge on cell type relations (e.g. B cells largely clustering together), but we also observe some unexpected patterns: Tcm/Naive helper T cells cluster with erythroid cells and macrophages rather than with the other T cells. This is a red flag! Possibly, the Tcm/Naive helper T cell annotations are wrong."
]
},
{
Expand Down
8 changes: 4 additions & 4 deletions jupyter-book/cellular_structure/clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@
"Preprocessing and visualization enabled us to describe our scRNA-seq dataset and reduce its dimensionality. Up to this point, we embedded and visualized cells to understand the underlying properties of our dataset. However, they are still rather abstractly defined. The next natural step in single-cell analysis is the identification of cellular structure in the dataset. \n",
"\n",
"In scRNA-seq data analysis, we describe cellular structure in our dataset with finding cell identities that relate to known cell states or cell cycle stages. This process is usually called cell identity annotation. For this purpose, we structure cells into clusters to infer the identity of similar cells. Clustering itself is a common unsupervised machine learning problem. \n",
"We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principle-component analysis and the similarity scoring is then based on Euclidean distances. \n",
"We can derive clusters by minimizing the intra-cluster distance in the reduced expression space. In this case, the expression space determines the gene expression similarity of cells with respect to a dimensionality-reduced representation. This lower dimensional representation is, for example, determined with a principal-component analysis and the similarity scoring is then based on Euclidean distances. \n",
"\n",
"In the KNN graph consists of nodes reflecting the cells in the dataset. We first calculate a Euclidean distance matrix on the PC-reduced expression space for all cells and then connect each cell to its K most similar cells. Usually, K is set to values between 5 and 100 depending on the size of the dataset. The KNN graph reflects the underlying topology of the expression data by representing dense regions with respect to expression space also as densely connected regions in the graph {cite}`wolf_paga_2019`. Dense regions in the KNN-graph are detected by community detection methods like Leiden and Louvain{cite}`blondel_fast_2008`. \n",
"The KNN graph consists of nodes reflecting the cells in the dataset. We first calculate a Euclidean distance matrix on the PC-reduced expression space for all cells and then connect each cell to its K most similar cells. Usually, K is set to values between 5 and 100 depending on the size of the dataset. The KNN graph reflects the underlying topology of the expression data by representing dense regions with respect to expression space also as densely connected regions in the graph {cite}`wolf_paga_2019`. Dense regions in the KNN-graph are detected by community detection methods like Leiden and Louvain{cite}`blondel_fast_2008`. \n",
"\n",
"The Leiden algorithm is as an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ({cite}`du_systematic_2018, freytag_comparison_2018, weber_comparison_2016`). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred. \n",
"The Leiden algorithm is an improved version of the Louvain algorithm which outperformed other clustering methods for single-cell RNA-seq data analysis ({cite}`du_systematic_2018, freytag_comparison_2018, weber_comparison_2016`). Since the Louvain algorithm is no longer maintained, using Leiden instead is preferred. \n",
"\n",
"We, therefore, propose to use the Leiden algorithm{cite}`traag_louvain_2019` on single-cell k-nearest-neighbour (KNN) graphs to cluster single-cell datasets. \n",
"\n",
Expand All @@ -39,7 +39,7 @@
"\n",
"<img src=\"../_static/images/clustering/clustering.jpeg\" alt=\"Clustering Overview\" class=\"bg-primary mb-1\" width=\"800px\">\n",
"\n",
"The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. \n",
"The Leiden algorithm computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node forms its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. \n",
"\n",
":::\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion jupyter-book/cellular_structure/integration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2175,7 +2175,7 @@
"id": "f7c4a891",
"metadata": {},
"source": [
"This integration is also improved compared to the unintegrated data with cell identities grouped together but we sill see some shifts between batches."
"This integration is also improved compared to the unintegrated data with cell identities grouped together but we still see some shifts between batches."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions jupyter-book/introduction/analysis_tools.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2687,7 +2687,7 @@
"source": [
"We can write a simple function to run principal component analysis on such a concatenated matrix. MuData object provides a place to store multimodal embeddings: MuData`.obsm`. It is similar to how the embeddings generated on individual modalities are stored, only this time it is saved inside the MuData object rather than in AnnData`.obsm`.\n",
"\n",
"To calculate for example a principal component analysis (PCA) for the joint values of the modalities, we horizontally stack the values stored in the individual modalities and then perform the PCA on the stacked matrix. This is possible because the number of observations matches across modalities (remember, the number of features does per modality does not have to match)."
"To calculate for example a principal component analysis (PCA) for the joint values of the modalities, we horizontally stack the values stored in the individual modalities and then perform the PCA on the stacked matrix. This is possible because the number of observations matches across modalities (remember, the number of features per modality does not have to match)."
]
},
{
Expand Down Expand Up @@ -2742,7 +2742,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable. This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as multi-omics integration methods. In the following section we will introduce muon which provides many tools to preprocess unimodal data besides beyond RNA-Seq and multi-omics integration methods."
"In reality, however, having different modalities often means that the features between them come from different generative processes and are not comparable. This is where special multimodal integration methods come into play. For omics technologies, these methods are frequently addressed as multi-omics integration methods. In the following section we will introduce muon which provides many tools to preprocess unimodal data beyond RNA-Seq and multi-omics integration methods."
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions jupyter-book/introduction/interoperability.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have discussed in the {ref}`analysis frameworks and tools chapter <introduction:analysis-frameworks>` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n",
"As we have discussed in the {ref}`analysis frameworks and tools chapter <introduction:analysis-frameworks>` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n",
"\n",
"While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully, lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages."
]
Expand Down Expand Up @@ -85,7 +85,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow](https://www.nextflow.io/index.html) or [snakemake](https://snakemake.readthedocs.io/en/stable/)). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages."
"The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [Nextflow](https://www.nextflow.io/index.html) or [snakemake](https://snakemake.readthedocs.io/en/stable/)). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages."
]
},
{
Expand Down Expand Up @@ -213,7 +213,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The [Bioconductor **{zellkonverter}** package](https://bioconductor.org/packages/zellkonverter/) helps makes this easier by using the [**{basilisk}** package](https://bioconductor.org/packages/basilisk/) to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python."
"The [Bioconductor **{zellkonverter}** package](https://bioconductor.org/packages/zellkonverter/) helps make this easier by using the [**{basilisk}** package](https://bioconductor.org/packages/basilisk/) to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python."
]
},
{
Expand Down
Loading
Loading