- - - -
- -
-
- -
-
- -
- -
- -
- - -
- -
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
- -
-
- - - - - - - - -
- -
-

Introduction 2#

-

In this notebook, we will cover the topics of:

-
    -
  • Dimensionality reduction

  • -
  • Clistering

  • -
  • Annotation

  • -
-
-
-

Library imports#

-

Install all packages for the tutorial

-
-
-
!pip install scanpy scrublet
-
-
-
-
-
-
-
import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-import scipy
-import os
-
-#single cell library
-import scanpy as sc
-
-
-
-
-
-
-

Set project file paths#

-

Let us set up the connection with Google Drive

-
-
-
from google.colab import drive
-drive.mount('/content/drive')
-
-
-
-
-

We set up the file paths to the respective directories.

-
-
-
file_path = '/content/drive/My Drive/' #this is the file path to your google drive (main directory)
-
-
-
-
-
-
-
os.listdir(file_path) #shows all files in file_path
-
-
-
-
-

The data directory contains all processed data and anndata files.

-
-
-
data_dir = file_path + 'GSCN_workshop_2024/Data/' 
-
-
-
-
-
-
-

Load data#

-

We read the data that has already been pre-processed, i.e., QC and normalized

-
-
-
input_path = '/Users/sara.jimenez/Documents/scWorkshop/data/'
-
-
-
-
-
-
-
adata = sc.read_h5ad(input_path + 'processed/processed_data.h5ad')
-
-
-
-
-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
-    uns: 'log1p'
-    layers: 'raw_data'
-
-
-
-
-
-
-

Highly Variable Genes#

-

We extract highly variable genes (HVGs) to further reduce the dimensionality of the dataset and include only the most informative genes. Genes that vary substantially across the dataset are informative of the underlying biological variation in the data. As we only want to capture biological variation in these genes, we select highly variable genes after normalization and batch correction. HVGs are used for clustering, trajectory inference, and dimensionality reduction/visualization, while the full data set is used for computing marker genes, differential testing, cell cycle scoring, and visualizing expression values on the data.

-

Typically between 1000 and 5000 genes are selected. Here, we extract the top 4000 most variable genes for further processing. If particular genes of importance are known, one could assess how many highly variable genes are necessary to include all, or the majority, of these.

-
-
-
sc.pp.highly_variable_genes(adata, layer = 'raw_data', n_top_genes = 4000, flavor = 'seurat_v3')
-
-
-
-
-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg'
-    layers: 'raw_data'
-
-
-
-
-
-
-
adata.var[['highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm']].head()
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
highly_variablehighly_variable_rankmeansvariancesvariances_norm
AL627309.1FalseNaN0.0095960.0101720.830376
AL627309.5FalseNaN0.0588280.0645520.752083
LINC01409FalseNaN0.1562080.2539911.094085
LINC01128FalseNaN0.0907040.1150300.854242
LINC00115FalseNaN0.0108480.0112320.794631
-
-
-
-
-
sc.pl.highly_variable_genes(adata)
-
-
-
-
-_images/cc5b1833ff51fe2ca926cd9bcaf7b0570f72e291270a100d0c69d7d57154f5af.png -
-
-

The plots show how the data was normalized to select highly variable genes irrespective of the mean expression of the genes. This is achieved by using the index of dispersion which divides by mean expression, and subsequently binning the data by mean expression and selecting the most variable genes within each bin.

-
-
-

Dimensionality reduction: PCA#

-

scRNA-seq attempts to measure the mRNA transcripts of all the genes (~20,000) at single-cell level resolution resulting in a high-dimensional data. The cells represent the samples while the genes represent the features or dimensions of the data. These features/dimensions define the phenotype of each single cell. A cellular phenotype vector points to the location of a cell in a high-dimensional space whose each axis is represented by each of the measured features (genes).

-

Meanwhile, we know that the genes and their products inside cells form modules, a result of genes being co-regulated. The modular and co-regulatory structure of gene interaction means that a typical scRNA-seq data displays correlation structure among groups of genes. Thus, we are capturing redundant information by measuring all genes from a single cell. Put differently, there are only a few key variables (possibly latent) that are driving the underlying biological process. One consequence of this realization is that the cellular phenotype space is really a low-dimensional object embedded onto a high-dimensional space, i.e. the biological system really has few free variables but is projected onto a space of thousands of features. We call the low-dimensional object the data manifold. It refers to the true structure of the data irrespective of how it is embedded; and understanding the manifold holds the key to uncovering the underlying biology. The notion of dimensionality reduction is thus to extract those few meaningful dimensions that are reflective of the true underlying biological process.

-

Futhermore, performing dimensionality reduction on the data reduces the number of dimensions, so it has an added advantage of speeding-up computation (especially when dealing with millions of cells).

-

PCA (Principal Component Analysis) is one of the most popular dimensionality reduction algorithms. In short, PCA takes in a high-dimensional data as input and outputs a transformed data with the same number of samples (cells) but with reduced number of dimensions such that the recovered dimensions represent directions of highest variances in the original data.

-
-
-
sc.tl.pca(adata, n_comps=100, use_highly_variable=True)
-
-
-
-
-

A brief explanation of the code above:

-

We provide adata as an input to the method. As with any other Scanpy method, since we do not specify anything else, the PCA is run on adata.X. Now, we recall that adata.X holds the log-normalized data (or the z-scored normalized data if you succesfully executed sc.pp.scale function above and did not set adata.X to be log-norm data). Remember: it is crucial that you run PCA on your normalized data (which for the sake of this example notebook is median library size normalization + log transformation).

-

Second, we specify n_comps = 100, indicating that we are interested only in the top 100 components.

-

Third, we specify use_highly_variable = True which ensures that PCA is computed only on the highly variable genes.

-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca'
-    obsm: 'X_pca'
-    varm: 'PCs'
-    layers: 'raw_data'
-
-
-
-
-

To choose the optimal number of principal components, we can visualize the cumulative variance explained:

-
-
-
plt.plot(range(len(adata.uns['pca']['variance_ratio'])), np.cumsum(adata.uns['pca']['variance_ratio']) * 100, '.-')
-plt.axvline(30, color = 'r')
-plt.xlabel('Principal Component', fontsize = 14)
-plt.ylabel('% Variance Explained', fontsize = 14)
-
-
-
-
-
Text(0, 0.5, '% Variance Explained')
-
-
-_images/81696a07bff3e55eb49fa872b1ad81de2550fa12728258511e982ca0650acbc6.png -
-
-

As we can see, the increase in variance explained is minimal after about 30 components (as indicated by the red line). So, it is reasonable to perform downstream analysis using 30 principal components.

-

One can argue that it might be sufficient to choose 20 or 25 principal components based on our argument above. But we recommend to be slightly conservative and err on the side of choosing slightly more principal components. However, since there is no absolute way to establish the right number of components, we strongly recommend that you consider evaluating consistency of downstream results (clustering or trajectories) by altering the number of principal components slightly around the chosen value. In the above example, you can see how downstream results change if you alter the number of principal components between 20 to 50.

-

For now, we will proceed with 30 components.

-
-
-
adata.obsm['X_pca'] = adata.obsm['X_pca'][:, 0:30]
-
-
-
-
-
-
-
adata.obsm['X_pca'].shape
-
-
-
-
-
(11984, 30)
-
-
-
-
-

Similarly, we want to update the loadings adata.uns['loadings'] and adata.varm['PCs'] to reflect only the top 30 principal components.

-
-
-
adata.varm['PCs'] = adata.varm['PCs'][:, 0:30]
-
-
-
-
-
-

Note: Genes-Loadings connection#

-

As we discussed above, we can use the loadings to identify the most influential genes onto each of the PCA.

-

We will sort the gene names by their loadings for the top 30 PCA. First let’s create a pandas dataframe of PCA loadings and the corresponding highly variable genes.

-
-
-
df_loadings = pd.DataFrame(adata.varm['PCs'], 
-                           index = adata.var.index, 
-                           columns = ['PC-' + str(j) for j in range(30)])
-
-
-
-
-
-
-
df_loadings.head()
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PC-0PC-1PC-2PC-3PC-4PC-5PC-6PC-7PC-8PC-9...PC-20PC-21PC-22PC-23PC-24PC-25PC-26PC-27PC-28PC-29
AL627309.10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
AL627309.50.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
LINC014090.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
LINC011280.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
LINC001150.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
-

5 rows × 30 columns

-
-
-

Since we ran PCA on HVG, let’s make sure that we look at the loadings specific to HVG only. For this we will only consider genes in the HVG.

-
-
-
df_loadings_hvg = df_loadings.loc[adata.var_names[adata.var['highly_variable']]]
-
-
-
-
-
-
-
df_loadings_hvg
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PC-0PC-1PC-2PC-3PC-4PC-5PC-6PC-7PC-8PC-9...PC-20PC-21PC-22PC-23PC-24PC-25PC-26PC-27PC-28PC-29
HES40.0034600.0014950.0046550.0249230.027061-0.005100-0.0109100.031612-0.002016-0.029858...-0.005577-0.0015920.001108-0.007328-0.0010870.0029840.005398-0.015227-0.003382-0.006898
ISG150.025548-0.0104280.0180050.0438840.0345180.016557-0.046805-0.1317850.082209-0.043594...-0.0121460.001861-0.000153-0.0766010.041966-0.040002-0.0150700.051024-0.027368-0.034242
TNFRSF18-0.001926-0.0022270.010640-0.000202-0.0017390.009756-0.0009580.0058880.001802-0.005096...0.0049890.007858-0.004543-0.0003080.0079130.000881-0.000160-0.003423-0.001428-0.003782
TNFRSF4-0.002326-0.0032710.0037860.001185-0.0002520.025617-0.0034690.0067190.003083-0.001872...0.0040270.014150-0.0080610.0002400.007270-0.004165-0.004229-0.004143-0.001100-0.005622
AJAP1-0.0001050.0007980.000185-0.0004540.0006490.0008100.009929-0.0011910.005828-0.004675...0.000171-0.000859-0.002204-0.0011900.0007330.0010440.0008360.000032-0.001007-0.000782
..................................................................
F80.000030-0.0003400.001149-0.000932-0.0001250.001024-0.000448-0.000670-0.0013110.000317...-0.000260-0.0001830.0033700.0002790.0020840.000202-0.000125-0.002253-0.0018270.000195
CLIC20.0012270.0011760.0005460.0031400.0054600.0012430.009093-0.000866-0.0104180.008147...-0.004586-0.0017620.004911-0.001560-0.002087-0.000505-0.003225-0.000520-0.0030540.001727
TMLHE-AS1-0.0018510.0049590.001356-0.0056230.005996-0.003384-0.0048240.003624-0.004341-0.001396...-0.0060340.002125-0.000821-0.007054-0.006851-0.009666-0.0029750.0007770.000721-0.001407
TMLHE0.0028690.0072320.000418-0.0068410.017279-0.002727-0.0052160.004149-0.010962-0.000693...-0.001637-0.008905-0.002481-0.010080-0.007547-0.013333-0.0077060.0139500.002581-0.007681
MT-CO10.0134340.0055540.0213700.025839-0.029545-0.017136-0.0042440.0306580.0369070.045738...-0.030893-0.021095-0.032621-0.0077030.0292230.0445500.025316-0.023050-0.0005110.020105
-

4000 rows × 30 columns

-
-
-

We can sort the dataframe constructed above for any particular PCA.

-

A note on the sign of the loadings: Since both the positive and negative extreme values of each principal components are treated equally in data interpretion, it stands to reason that we want to look at the absolute value of the loadings.

-
-
-
np.abs(df_loadings_hvg).sort_values(by = 'PC-0', ascending = False)['PC-0']
-
-
-
-
-
LYZ         1.400950e-01
-S100A9      1.391812e-01
-S100A8      1.175034e-01
-CTSS        1.035663e-01
-VCAN        9.564024e-02
-                ...     
-DCLK2       1.556435e-06
-SCG5        5.045683e-07
-NRG3        3.220663e-07
-GPR173      2.408021e-07
-TMEM132B    1.863688e-07
-Name: PC-0, Length: 4000, dtype: float64
-
-
-
-
-

As we can see, the first principal component is dominated by LYZ, S100A9, S100A8, CTSS etc. which are certain myeloid markers. This indicates that the most variance in the data is driven by myeloid specific genes.

-

Whether these genes are relevant/important for your study is something you will have to decide. Why might we want to study the loadings? The reason is sometimes we notice that some of the principal components are driven by the expression of mitochondrial genes or ribosomal genes. And depending on the question you are trying to answer it might be of interest to remove the effect of such genes. This way you can identify if the principal components are being influenced by these nuisance genes and if you need to adjust your set of hvg (for example remove mitochondrial genes or ribosomal genes from the list of hvg or add some new set of context specific genes) for downstream computation.

-
-
-
sc.pl.pca(adata, color = ['LYZ', 'MS4A1', 'CD3E', 'NKG7', 'mt_pct_content', 'log1p_total_counts'], 
-          ncols = 3, vmin = 'p1', vmax = 'p99')
-
-
-
-
-_images/1ad69192219672393ea782a4a38889d7d7edd9e71af26e6d236ee6b1447b2a73.png -
-
-
-
-
-

Nearest neighbor graph#

-

As we have already mentioned, cellular phenotype space is modeled as a low-dimensional object (called the data manifold) embedded in a high-dimensional space. The expression pattern of genes define the manifold structure. So cells with similar expression profiles are close to each other on the manifold. In other words, the local neighborhood of each cell on the manifold consists of similar cells. This is an important property that must be satisified by any method to accurately capture the structure of the manifold (such as for visualization). In addition, the manifold only depends on the connectivity of the data points - defined via pairwise distances - and not on how the points are arranged in space. Therefore, quantifying cell-cell distance is important to study the manifold structure of the data.

-

In the context of single-cell biology, the data manifold is a characterization of the available space of phenotypes for any cell in that context. Exploring the manifold is therefore equivalent to exploring the possible cellular states, and any defect/perturbation in the molecular processes should reflect as a change in the phenotype landscape. This deviation from normal expression would reflect as a change in characteristics of the data manifold, such as volume or variance or regions of occupancy, analyzing which can help us understand the cause and consequences of the changes.

-

Understanding the manifold of a high-dimensional data is a problem that pervades several disciplines. As a result, many methods have been proposed to tackle this problem. It is easier to obtain the structure if the underlying manifold is linear - that is, the observed features can be written as a linear combination of the latent variables. Linear methods such as Principal Component Analysis (PCA), discussed above, have been successfully applied in such cases. However, in scRNA-seq data, genes often display non-linear association. The non-linearity is a result of the measured genes themselves being a non-linear combination of the latent variables driving the process, which is in turn a result of the cascades of non-linear interactions, checkpoints and feedback mechanisms a cell has in place. Thus, characterization of the manifold requires us to involve methods that take into account the non-linearity present in single-cell data.

-

To tackle the issue of non-linearity, we aim to characterize the data manifold by first analyzing data locally around every cell, and then combining them. We assume that around each cell, the local neighborhood is roughly linear (or Euclidean), so we will apply linear methods locally and aggregate to get the total picture of the manifold. This notion of analyzing data to characterize the underlying manifold has proven beneficial in single-cell data analysis. From a mathematical point of view, graphs are the perfect tools designed for this.

-

A graph is defined as a set G = {V,E,W} consisting of a set of nodes or vertices V, a set of edges E connecting all or some of the vertices and weights W assigned to each of the edges. In the graph representation of single-cell data, each cell is a vertex or node in the graph, two cells are connected if they are similar to each other and the weight between them is the quantification of similarity between them. The similarity between two cells can be obtained by measuring the distance between them. The choice of distance metric, however, is context dependent but due to its ease and useful properties, Euclidean distance seems to be the preferred choice in the single-cell field. The manifold is congruent to a Euclidean space locally around any given point so, it is sensible to use Euclidean distances to compute similarity in the small neighborhood. In particular, for every cell we first find the k nearest neighbors (typically k is set to < 1% of the total number of cells profiled, a typical choice is k = 30) based on Euclidean distance, which will comprise the local neighborhood. Each cell is thus connected to its k nearest neighbors by an edge and the constructed graph (called the k-nearest neighbor graph) approximates the underlying manifold. The phenotypic distance between any two cells on such graph is then defined by geodesics along the graph, which more accurately captures biological differences between cells.

-

Majority of the visualization, clustering and trajectory inference tools build on top of the above constructed graph. While the thus constructed graph may need to be further adjusted to account for the inherent noise in the data it forms the backbone of most of the tools we will discuss henceforth. We use Scanpy’s inbuilt method to construct graph:

-
-
-
sc.pp.neighbors(adata, n_neighbors=30, use_rep='X_pca', metric='euclidean', key_added='neighbors_30')
-
-
-
-
-

An explanation of the code above:

-
    -
  1. We ask for k = 30 nearest neighbors. This is a typical choice in data analysis but you can compute graphs for multiple values of k and study how the downstream result changes (in fact we encourage you to do this to study how your downstream biological interpretations (such as visuals, clustering etc.) alter with slight changes in k. For example for this case, you may want to study values of k between 20-40.)

  2. -
  3. We set use_rep = 'X_pca' because we want to compute the k nearest neighbor graph on the above computed PCA. PCA has been observed to be good in capturing global aspects of the data, but to study finer structures of the biological manifold, we recommend building graph on top of the PCA. Furthermore, considering only top principal components has the added advantage of removing noise from the data represented by the lower principal components.

  4. -
  5. We set metric = 'euclidean' but you are encouraged to explore other metrics such as 'cosine' or 'correlation'. The single-cell field still has not agreed upon the best choice of metric or comprehensively identified situations where certain metrics might be better suited than others.

  6. -
  7. We set key_added = 'neighbors_30' to store our results with the name ‘neighbors_30’. This can be a handy tool should you want to compute multiple graphs, as you can simply store multiple results with different names. For more details on this, please see Scanpy API on nearest neighbors.

  8. -
-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30'
-    obsm: 'X_pca'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
adata.obsp['neighbors_30_distances']
-
-
-
-
-
<11984x11984 sparse matrix of type '<class 'numpy.float64'>'
-	with 347536 stored elements in Compressed Sparse Row format>
-
-
-
-
-

The distances (which measure dissimilarity between cells) are represented as a sparse matrix for 11984 cells as shown above. Scanpy also converts the distances into similarities and outputs them as connectivities.

-
-
-
adata.obsp['neighbors_30_connectivities']
-
-
-
-
-
<11984x11984 sparse matrix of type '<class 'numpy.float32'>'
-	with 500544 stored elements in Compressed Sparse Row format>
-
-
-
-
-

Note the number of non-zero entries in the matrices are different because the connectivities is a symmetric matrix while distances is not.

-

All major downstream analysis methods utilize thus constructed graph to visualize, cluster or infer trajectories from the data. We begin by exploring the visualization methods and will run UMAP and Force Directed Layout. As such, if you want to change the properties of the neighbor graph (e.g. for a specific visualization or clustering), you would need to re-run the neighbors command.

-
-
-

Data Visualization#

-
-

UMAP#

-

Uniform Manifold Approximation and Projection (UMAP) has been applied to visualize and analyze sc-RNAseq data. In summary, UMAP uses the graph we built above and tries to arrange data points on a low-dimensional space (typically in 2D) such that the cross-entropy of the high-dimensional graph weights and low-dimensional graph weights is minimized. The author of UMAP has written a beautiful exposition on how UMAP works, which can be found here: https://bit.ly/2qGhBkk.

-

Note: UMAP preserves high-dimensional distances on the projected space, which suggests that one can do further analysis on UMAP features. However, this should be done with care and we strongly recommend against it. Specifically because we have the tools needed to perform efficient analytic computations (such as clustering, trajectory detection etc.) in the high dimensions itself; thereby precluding the need to do computations on UMAP. For some FAQs on UMAP, please see this page: https://bit.ly/2NfgAxY.

-
-
-
sc.tl.umap(adata, neighbors_key = 'neighbors_30', min_dist=0.1)
-
-
-
-
-

Because we saved our k-nearest neighbor results as neighbors_30, we need to specify it as above. We recommend setting min_dist = 0.1 because that was the original default proposed in UMAP algorithm and seems to produce more visually appealing plots. We can do this because the sole purpose of UMAP is to aid in visualization. No quantitative information should be extracted from UMAP (e.g. definite distance/proximity between cells/clusters).

-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
adata.obsm['X_umap']
-
-
-
-
-
array([[ 1.6250963,  5.5733585],
-       [-4.103601 ,  6.767667 ],
-       [-1.0423499,  6.822509 ],
-       ...,
-       [10.327127 , 10.977615 ],
-       [15.360161 ,  2.9707499],
-       [ 9.139833 , -1.1203988]], dtype=float32)
-
-
-
-
-
-
-
adata.obsm['X_umap'].shape
-
-
-
-
-
(11984, 2)
-
-
-
-
-

Now we can visualize the results:

-
-
-
sc.pl.umap(adata)
-
-
-
-
-
/Users/sara.jimenez/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
-  cax = scatter(
-
-
-_images/85c5a6b479188b7f4103a9922a6b7fe99201cdbba46d4f28a07f3ab7346c2e29.png -
-
-
-
-
-

Visualization examples#

-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-

Color by metadata#

-
-
-
sc.pl.umap(adata, color = ['log1p_total_counts', 'log1p_n_genes_by_counts', 'mt_pct_content', 'ribo_pct_content'], 
-           ncols = 2, cmap = 'plasma', use_raw=False)
-
-
-
-
-_images/6cece7923032727a85a845510543afaa041a232e891fa75d56eeca86017b00f5.png -
-
-
-
-

Color by gene expression#

-

We can use the computed UMAP projections to visualize gene expression. This can help us perform some basic sanity checks (if known genes are expressed appropriately), develop hypotheses and aid in cell type annotation (which we will look at next in clustering segment).

-

For now, since we are looking at PBMCs, we will do some sanity checks using established set of genes used to identify various PBMC cell types (list taken from: https://satijalab.org/seurat/archive/v3.0/pbmc3k_tutorial.html)

-

[‘PTPRC’, ‘CD3E’, ‘CD8A’, ‘CD4’, ‘IL7R’, ‘MS4A1’, ‘CD19’, ‘LYZ’, ‘FCGR3A’, ‘NKG7’, ‘CST3’, ‘PPBP’]

-

We can make these visuals using Scanpy’s inbuilt UMAP plot function. Important note: By default, Scanpy’s inbuilt function may rely on adata.raw to extract gene expression if you have adata.raw saved. So please be careful that you set use_raw = False if you have adata.raw set to store the original raw counts as you may rather want to visualize log-normalized counts which is stored in adata.X.

-
-
-
genes = ['CD3E', 'CD8A', 'CD4', 'IL7R', 'MS4A1', 'CD19', 'LYZ', 'FCGR3A', 'NKG7', 'CST3', 'PPBP', 'PTPRC']
-sc.pl.umap(adata, color = genes, cmap = 'plasma', ncols = 3, use_raw=False)
-
-
-
-
-_images/3adfb12b7ba269d969bc4a327200c189b68b5ae33d87b8a382b4157e96cfa469.png -
-
-

A note: Sometimes it helps to trim the expression levels of genes to certain percentiles to avoid visualization being biased by a few super-high or super-low values. For this we can utilizie vmin and vmax parameter during plotting as shown below. We will set vmin = 1 percentile of gene expression and vmax = 99 percentile of gene expression, which means min gene expression is going to be 1 percentile of the original expression and analogously for max expression.

-
-
-
genes = ['CD3E', 'CD8A', 'CD4', 'IL7R', 'MS4A1', 'CD19', 'LYZ', 'FCGR3A', 'NKG7', 'CST3', 'PPBP', 'PTPRC']
-sc.pl.umap(adata, color = genes, cmap = 'plasma', ncols = 3, use_raw=False, vmin = 'p1', vmax = 'p99')
-
-
-
-
-_images/1e76bb05cc442171acc0a8b46d30725c1a5bad6dfc7d0d30387f8bb0b860b3cf.png -
-
-
-

Alternate code (more manual)#

-
-
-
genes = ['CD3E', 'CD8A', 'CD4', 'IL7R', 'MS4A1', 'CD19', 'LYZ', 'FCGR3A', 'NKG7', 'CST3', 'PPBP', 'PTPRC']
-
-fig_nrow = 3
-fig_ncol = 4
-
-fig = plt.figure(figsize = (8*fig_ncol, 6*fig_nrow))
-count = 0
-for gene_name in genes:
-    ax = fig.add_subplot(fig_nrow, fig_ncol, count + 1)
-    col_gene = adata.X.toarray()[:, adata.var_names.get_loc(gene_name)]
-    im1 = ax.scatter(adata.obsm['X_umap'][:, 0], adata.obsm['X_umap'][:, 1], s = 2, c = col_gene, 
-                    cmap = 'plasma')
-    ax.set_xlabel('UMAP-1', fontsize = 14)
-    ax.set_ylabel('UMAP-2', fontsize = 14)
-    ax.set_title('UMAP colored by ' + gene_name, fontsize = 14)
-    ax.set_xticks([])
-    ax.set_yticks([])
-    fig.colorbar(im1)
-    count = count + 1
-    
-# To save, please uncomment the following line of code:
-# fig.savefig(output_directory, 'UMAP_colored_by_genes.png', dpi = 150, bbox_inches = 'tight')
-
-
-
-
-_images/b582f799b812d66b55bd80301970dbbda75d4ab83e343463af72f0ceed8b9f05.png -
-
-

Such visualizations can help perform data analysis better. You may already be able to draw many hypotheses and make assessment of the data using such visualizations.

-

Similarly, alternate code to trim the gene expression at 1 percentile and 99 percentile.

-
-
-
genes = ['CD3E', 'CD8A', 'CD4', 'IL7R', 'MS4A1', 'CD19', 'LYZ', 'FCGR3A', 'NKG7', 'CST3', 'PPBP', 'PTPRC']
-
-fig_nrow = 3
-fig_ncol = 4
-
-fig = plt.figure(figsize = (8*fig_ncol, 6*fig_nrow))
-count = 0
-for gene_name in genes:
-    ax = fig.add_subplot(fig_nrow, fig_ncol, count + 1)
-    col_gene = adata.X.toarray()[:, adata.var_names.get_loc(gene_name)]
-    im1 = ax.scatter(adata.obsm['X_umap'][:, 0], adata.obsm['X_umap'][:, 1], s = 2, c = col_gene, 
-                    cmap = 'plasma', vmin = np.percentile(col_gene, 1), vmax = np.percentile(col_gene, 99))
-    ax.set_xlabel('UMAP-1', fontsize = 14)
-    ax.set_ylabel('UMAP-2', fontsize = 14)
-    ax.set_title('UMAP colored by ' + gene_name, fontsize = 14)
-    ax.set_xticks([])
-    ax.set_yticks([])
-    fig.colorbar(im1)
-    count = count + 1
-    
-# To save, please uncomment the following line of code:
-# fig.savefig(output_directory, 'UMAP_colored_by_genes.png', dpi = 150, bbox_inches = 'tight')
-
-
-
-
-_images/1175a752771bcb16c1d4f6eb80694c0d7fbdae8d06eb16f55bd0202378203462.png -
-
-
-
-
-

Visualize gene signature#

-

You can also visualize established gene signatures (from published papers or bulk signatures) onto the computed UMAP. For example, you may be familiar with Hallmark or Reactome or GO genesets. You can load any of your favorite pathway and visualize the expression using the following code as a guideline. For illustration, we have provided you with an example geneset called geneset.txt and we will use that to visualize the expression (taken from: https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_G2M_CHECKPOINT.html).

-
-
-
# load the file
-geneset = pd.read_csv(input_path+'HALLMARK_G2M_CHECKPOINT.v2023.2.Hs.grp')
-
-
-
-
-
-
-
geneset
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HALLMARK_G2M_CHECKPOINT
0# https://www.gsea-msigdb.org/gsea/msigdb/huma...
1ABL1
2AMD1
3ARID4A
4ATF5
......
196UCK2
197UPF1
198WRN
199XPO1
200YTHDC1
-

201 rows × 1 columns

-
-
-
-
-
genes = geneset['HALLMARK_G2M_CHECKPOINT']
-genes
-
-
-
-
-
0      # https://www.gsea-msigdb.org/gsea/msigdb/huma...
-1                                                   ABL1
-2                                                   AMD1
-3                                                 ARID4A
-4                                                   ATF5
-                             ...                        
-196                                                 UCK2
-197                                                 UPF1
-198                                                  WRN
-199                                                 XPO1
-200                                               YTHDC1
-Name: HALLMARK_G2M_CHECKPOINT, Length: 201, dtype: object
-
-
-
-
-
-
-
type(genes)
-
-
-
-
-
pandas.core.series.Series
-
-
-
-
-
-
-
# Compute a gene signature score using Scanpy's inbuilt function
-sc.tl.score_genes(adata, genes, score_name='G2_M_checkpoint', random_state=0, use_raw = False)
-
-
-
-
-
WARNING: genes are not in var_names and ignored: ['# https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_G2M_CHECKPOINT', 'BIRC5', 'CDC20', 'CDC25A', 'CDC45', 'CDK1', 'EGF', 'ESPL1', 'EXO1', 'H2AX', 'H2AZ1', 'H2AZ2', 'H2BC12', 'HOXC10', 'MEIS2', 'NEK2', 'PBK', 'PTTG3P', 'TROAP', 'TTK', 'UBE2C']
-
-
-
-
-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet', 'G2_M_checkpoint'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
sc.pl.umap(adata, color = 'G2_M_checkpoint', cmap = 'plasma')
-
-
-
-
-_images/6bbba5188fbf4575a4e235fea24902e98f6e3aa516b4ba3ef7e91f5daffd2a41.png -
-
-
-
-
sc.pl.umap(adata, color = 'G2_M_checkpoint', cmap = 'plasma', vmin = 'p1', vmax = 'p99')
-
-
-
-
-_images/6f82931e5452a83ebf83330ce9bc9282f1e30c7d96c2f8dbd8edc4245188411a.png -
-
-
-
-
-

Clustering#

-

One of the primary applications of scRNA-seq has been in the identification of novel and rare cell types. In contrast to bulk measurements, which obscure the cell-cell heterogeneity, single-cell level analysis offers an unprecedented precision to classify cells into different types.

-

To think about it computationally: The data presents us with a bunch of objects (individual cells) that have a certain set of features (genes), and we would like to classify the objects into distinct groups or categories each consisting of similar cells. In our case, these distinct groups or categories can be biologically interpreted as cell types. Therefore, in essence the problem of identifying cell types (novel, rare or known) boils down to a problem of clustering.

-

Performing Modularity optimization by Leiden community detection on the k-nearest-neighbour graph of cells has become an established practice in scRNA-seq analysis. Thus, this is the method of choice in this tutorial as well.

-

Here, we perform clustering at two resolutions. Investigating several resolutions allows us to select a clustering that appears to capture the main clusters in the visualization and can provide a good baseline for further subclustering of the data to identify more specific substructure.

-

Clustering is performed on the highly variable gene data, dimensionality reduced by PCA, and embedded into a KNN graph.

-

Compute a leiden clustering with two different resolutions (0.5 and 1.5). Compare the clusterings in a table and visualize the clustering in an embedding.

-
-

Leiden clustering#

-
-
-
sc.tl.leiden(adata, resolution = 1.5, random_state=0, neighbors_key='neighbors_30', key_added='leiden_r1.5')
-sc.tl.leiden(adata, resolution = 0.5, random_state=0, neighbors_key='neighbors_30', key_added='leiden_r0.5')
-
-
-
-
-
-
-
adata
-
-
-
-
-
AnnData object with n_obs × n_vars = 11984 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet', 'G2_M_checkpoint', 'leiden_r1.5', 'leiden_r0.5'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap', 'leiden'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
pd.crosstab(adata.obs['leiden_r0.5'], adata.obs['leiden_r1.5'])
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
leiden_r1.50123456789...15161718192021222324
leiden_r0.5
001899300941617000...0200100000
1500967100000...17000010001
21291006000000...706001030100
3000096200000...2020000001
4010284800019100...0100000000
50000000000...31500129000000
6000000161300...0000000000
7000000005710...00000007200
8000000000494...0000000000
90000000000...00000000310
100000000000...0000000000
110030001000...0190000000130
120000000000...001480000000
130000000000...000011400000
140000000000...00000082000
150000000000...00000000025
-

16 rows × 25 columns

-
-
-
-
-
#Visualize the clustering 
-sc.pl.umap(adata, color=['leiden_r1.5', 'leiden_r0.5'], wspace=0.6)
-
-
-
-
-
/Users/sara.jimenez/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
-  cax = scatter(
-/Users/sara.jimenez/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
-  cax = scatter(
-
-
-_images/1a3964c55a96b6bd64555c746fc39c822d172c3b64ca58b8c855718ce2bb5115.png -
-
-
-
-

Investigate % mito content, library size, number of genes expressed in each cluster#

-
-
-
adata.obs.columns
-
-
-
-
-
Index(['n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts',
-       'log1p_total_counts', 'pct_counts_in_top_50_genes',
-       'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes',
-       'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content',
-       'doublet_score', 'doublet', 'G2_M_checkpoint', 'leiden_r1.5',
-       'leiden_r0.5'],
-      dtype='object')
-
-
-
-
-
-
-
df_temp = pd.DataFrame({'leiden': adata.obs['leiden_r1.5'], 
-                        '%-Mito': adata.obs['mt_pct_content'], 
-                        'library_size': adata.obs['log1p_total_counts'], 
-                        'n_genes_per_cell': adata.obs['log1p_n_genes_by_counts']}, index = adata.obs.index)
-
-
-
-
-
-
-
df_temp
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
leiden%-Mitolibrary_sizen_genes_per_cell
AAACCCAAGGCCCAAA-174.8564729.6906038.354910
AAACCCAAGTAATACG-164.1814499.3791548.239065
AAACCCAAGTCACACT-115.8166019.5814918.330382
AAACCCACAAAGCGTG-133.7996098.9471567.835975
AAACCCACAATCGAAA-185.6993929.3510588.158802
...............
TTTGTTGGTTGGATCT-144.2106409.1363717.935945
TTTGTTGGTTTCTTAC-1113.0390328.8456347.969704
TTTGTTGTCCATTTCA-1205.3553628.8568037.839919
TTTGTTGTCTACACAG-1123.0271629.1744028.183677
TTTGTTGTCTCATTAC-1223.1191969.6559238.444192
-

11984 rows × 4 columns

-
-
-
-
-
import seaborn as sns
-fig = plt.figure(figsize = (8*2, 6*3))
-ax = fig.add_subplot(3, 1, 1)
-sns.boxplot(x = "leiden", y = "%-Mito", hue = 'leiden', data = df_temp, ax = ax, 
-            palette = sc.pl.palettes.zeileis_28)
-ax.grid('on')
-ax.legend().set_visible(False)
-
-ax = fig.add_subplot(3, 1, 2)
-sns.boxplot(x = "leiden", y = "library_size", hue = 'leiden', data = df_temp, ax = ax, 
-            palette = sc.pl.palettes.zeileis_28)
-ax.grid('on')
-ax.legend().set_visible(False)
-
-ax = fig.add_subplot(3, 1, 3)
-sns.boxplot(x = "leiden", y = "n_genes_per_cell", hue = 'leiden', data = df_temp, ax = ax, 
-            palette = sc.pl.palettes.zeileis_28)
-ax.grid('on')
-ax.legend(loc="lower left", ncol = 5)
-
-
-
-
-
<matplotlib.legend.Legend at 0x28b16ef10>
-
-
-_images/4c0e2e69d8e419bdf13958408c56bdc0bb552106509804d63d09e3e06743c76e.png -
-
-
-

Highlight clusters of interest#

-
-
-
clusters_interest = ['17']
-colors = ['red']
-plt.scatter(adata.obsm['X_umap'][:, 0], adata.obsm['X_umap'][:, 1], s = 1, color = 'gray')
-for counter, item in enumerate(clusters_interest):
-    cells_in_cluster = adata.obs['leiden_r1.5'] == item
-    plt.scatter(adata.obsm['X_umap'][cells_in_cluster, 0], adata.obsm['X_umap'][cells_in_cluster, 1], s = 1, 
-                color = colors[counter], label = 'Cluster-' + str(item))
-    
-plt.legend(markerscale = 5)
-plt.axis('off')
-
-
-
-
-
(-7.401524209976197, 18.82111325263977, -6.494140005111694, 18.89473376274109)
-
-
-_images/9176edd2d64c01ac947653eb269dec4d686407f87c69f5068d29b7a668d99260.png -
-
-
-
-

Remove the clusters#

-
-
-
adata_clean = adata[~np.isin(adata.obs['leiden_r1.5'], clusters_interest), :]
-
-
-
-
-
-
-
adata_clean
-
-
-
-
-
View of AnnData object with n_obs × n_vars = 11828 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet', 'G2_M_checkpoint', 'leiden_r1.5', 'leiden_r0.5'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap', 'leiden', 'leiden_r1.5_colors', 'leiden_r0.5_colors'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
-
-

Re-do pre-processing#

-
-
-
# HVG
-sc.pp.highly_variable_genes(adata_clean, layer = 'raw_data', n_top_genes = 4000, flavor = 'seurat_v3')
-
-
-
-
-
/Users/sara.jimenez/miniconda3/envs/scvi-env/lib/python3.9/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
-  self.data[key] = value
-
-
-
-
-
-
-
# PCA
-sc.tl.pca(adata_clean, n_comps=30, use_highly_variable=True)
-
-
-
-
-
-
-
# Nearest Neighbors
-sc.pp.neighbors(adata_clean, n_neighbors=30, use_rep = 'X_pca', key_added= 'neighbors_30')
-
-
-
-
-
-
-
# UMAP
-sc.tl.umap(adata_clean, neighbors_key = 'neighbors_30', min_dist=0.1)
-
-
-
-
-
-
-
# Clustering
-sc.tl.leiden(adata_clean, resolution = 1, random_state=0, neighbors_key='neighbors_30')
-
-
-
-
-
-
-
adata_clean
-
-
-
-
-
AnnData object with n_obs × n_vars = 11828 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet', 'G2_M_checkpoint', 'leiden_r1.5', 'leiden_r0.5', 'leiden'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap', 'leiden', 'leiden_r1.5_colors', 'leiden_r0.5_colors'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
sc.pl.umap(adata_clean, color = 'leiden')
-
-
-
-
-
/Users/sara.jimenez/miniconda3/envs/scvi-env/lib/python3.9/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
-  cax = scatter(
-
-
-_images/1f809b69b518d05d2d3312681023bf73576d02daa5a5976f7c52904805d6fccd.png -
-
-
-
-

Marker genes and cluster annotation#

-

To annotate the clusters we obtained, we find genes that are up-regulated in the cluster compared to all other clusters (marker genes). This differential expression test is performed by a Welch t-test with overestimated variance to be conservative. This is the default in scanpy. The test is automatically performed on the .raw data set, which is uncorrected and contains all genes. All genes are taken into account, as any gene may be an informative marker.

-

Annotation can be performed in an manual fashion or automated.

-
    -
  • Manual annotation: based on the expression of marker genes, consultation with experts

  • -
  • Automated annotation: the development of large-scale atlases of different tissues allows to map the datasets to references. They are based on different principles, sometimes requiring pre-defined sets of markers, other times trained on pre-existing full scRNA-seq datasets.See also several reviews Pasquini et al., 2021, Abdelaal et al., 2019 for a more elaborate discussion of automated annotation methods.

  • -
-

The resulting annotations can be of varying quality. It is therefore important to regard these methods as a starting point rather than an end-point of the annotation process.

-

Compute the differential expression profile for each cluster with rank_genes_groups and visualize the results.

-
-
-
adata_clean
-
-
-
-
-
AnnData object with n_obs × n_vars = 11828 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'doublet_score', 'doublet', 'G2_M_checkpoint', 'leiden_r1.5', 'leiden_r0.5', 'leiden'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'log1p', 'hvg', 'pca', 'neighbors_30', 'umap', 'leiden', 'leiden_r1.5_colors', 'leiden_r0.5_colors', 'leiden_colors', 'rank_genes_groups'
-    obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_distances', 'neighbors_30_connectivities'
-
-
-
-
-
-
-
sc.pp.log1p(adata_clean)
-
-
-
-
-
WARNING: adata.X seems to be already log-transformed.
-
-
-
-
-
-
-
#Calculate marker genes
-sc.tl.rank_genes_groups(adata_clean, groupby='leiden')
-
-
-
-
-
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
-
-
-
-
-
-
-
#Plot marker genes
-sc.pl.rank_genes_groups(adata_clean, fontsize=12)
-
-
-
-
-_images/22e142d2d54aba36486d1a232067279d258deea015816f44e2e3d8b83b188414.png -
-
-

Here, we observe potentially characteristic gene expression patterns.

-

Furthermore, the score itself is not interpretable in terms of specificity and significance in the case of clustering, because the clusters were previously defined as a group of cells being different from the rest. Therefore, we compare a group that is a priori different to the rest and the resulting scores (or p-values) are inflated. Furthermore, the smaller a cluster is, the smaller is the observed score, unless a gene is very specific to the cluster. Typically, we may find marker genes in the gene lists of the rank_genes_groups test, but not all marker genes have a high expression level.

-
-

Wilcoxon test#

-

Seurat has a FindMarkers function (https://satijalab.org/seurat/reference/findmarkers) to compute differentially expressed genes, and by default it performs wilcoxon test for a gene in two different groups (typically on z-scored/scaled data). We will use the rank_genes_groups function in Scanpy to do the same.

-

On log normalized data

-
-
-
sc.tl.rank_genes_groups(adata_clean, groupby = 'leiden', groups='all', reference='rest', 
-                        key_added='on_norm_log', method='wilcoxon')
-
-
-
-
-

Example output for Cluster 0:

-
-
-
df_wt = sc.get.rank_genes_groups_df(adata_clean, group = str(0), key = 'on_norm_log')
-
-
-
-
-
-
-
df_wt
-
-
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
namesscoreslogfoldchangespvalspvals_adj
0FGL258.5520782.4907410.00.0
1CPVL56.5148122.6088120.00.0
2LAP356.2150762.0826700.00.0
3TNFSF1055.2523611.9227990.00.0
4JAK255.1444211.9158970.00.0
..................
17693RPS29-45.930557-0.4081030.00.0
17694RPS19-46.330307-0.4028180.00.0
17695RPSA-46.638294-0.6854270.00.0
17696MALAT1-47.290688-0.2304230.00.0
17697RPS3-47.496426-0.4624480.00.0
-

17698 rows × 5 columns

-
-
-

Get the results for each cluster and save:

-
-
-
for cluster in np.unique(adata_clean.obs['leiden']):
-    df = sc.get.rank_genes_groups_df(adata_clean, group = str(cluster), key = 'on_norm_log')
-    df.to_csv('path/to/save/wilcoxon_cluster_' + str(cluster) + '_vs_rest.csv')
-
-
-
-
-
-
-

Other approaches:#

-

If you wish to test other DEA approaches, please consider the following suggestings:

-
    -
  1. pseudobulk based edgeR analysis (https://www.sc-best-practices.org/conditions/differential_gene_expression.html)

  2. -
  3. Memento: https://www.biorxiv.org/content/10.1101/2022.11.09.515836v1, yelabucsf/scrna-parameter-estimation

  4. -
-
-
-

Explore data using prior knowledge#

-

When it comes to cluster annotation, we usually have to tap into prior knowledge of the cell type. Depending on the data set, this may involve extensive literature search.

-

In the case of PBMCs, we may refer to several studies and single-cell RNA-sequencing data analysis tutorials to extract marker gene sets.

-

The following list is extracted from the Seurat tutorial on PBMCs.

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Marker Gene

Cell Type

IL7R

CD4 T cells

CD14, LYZ

CD14+ Monocytes

MS4A1

B cells

CD8A

CD8 T cells

FCGR3A, MS4A7

FCGR3A+ Monocytes

GNLY, NKG7

NK cells

FCER1A, CST3

Dendritic Cells

PPBP

Megakaryocytes

-
-
-
-
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
-                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
-                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
-
-
-
-
-

Tasks: Annotate the clusters. -Check briefly, if all marker genes are present in the dataset and visualise the marker genes in a UMAP (or another visualisation of your choice). -You can use auxiliary plots like matrixplot, dotplot, heatmap or violin plots or coloring an embedding (e.g. UMAP) by the marker genes.

-

Let us check if the marker genes are expressed in our dataset.

-
-
-
np.in1d(marker_genes, adata_clean.var_names)
-
-
-
-
-
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
-        True,  True,  True,  True,  True,  True,  True,  True])
-
-
-
-
-
-
-
#plots
-
-
-
-
-
-
-
sc.pl.dotplot(adata=, 
-              var_names =,
-              groupby=, 
-              use_raw=False)
-
-
-
-
-
-
-
sc.pl.heatmap(adata=, 
-              var_names=, 
-              figsize=(5,10),
-              groupby=, 
-              use_raw=False, vmin=0)
-
-
-
-
-
-
-
sc.pl.matrixplot(adata=, 
-                 var_names=,
-                 groupby=, 
-                 use_raw=False, vmin=0)
-
-
-
-
-
-
-
sc.pl.stacked_violin(adata = ,
-                     var_names = , 
-                     groupby=, 
-                     use_raw=False)
-
-
-
-
-

Annotate clusters and create a new covariate.

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Cluster ID

Marker Gene

Cell Type

IL7R

CD4 T cells

CD14, LYZ

CD14+ Monocytes

MS4A1

B cells

CD8A

CD8 T cells

FCGR3A, MS4A7

FCGR3A+ Monocytes

GNLY, NKG7

NK cells

FCER1A, CST3

Dendritic Cells

PPBP

Megakaryocytes

-
-

Use the pandas data frame functionality to rename your clusters and visualize your annotation.

-
-
-
adata_clean.obs['annotated'] = adata_clean.obs['louvain_r1.5'].cat.add_categories(['CD4 T cells', 
-                        'CD14+ Monocytes', 'B cells', 'CD8 T cells', 
-                        'FCGR3A+ Monocytes', 'NK cells', 'Dendritic cells', 'Megakaryocytes'])
-
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [ #add cluster name here (as string)    
-                                                            ])] = 'CD4 T cells'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'CD14+ Monocytes'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'B cells'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'CD8 T cells'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'FCGR3A+ Monocytes'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'NK cells'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'Dendritic cells'
-adata_clean.obs['annotated'][np.in1d(adata_clean.obs['annotated'], [])] = 'Megakaryocytes'
-
-#remove unused categories from annotation
-adata_clean.obs['annotated'] = adata.obs['annotated'].cat.remove_unused_categories()
-
-
-
-
-
-
-
adata_clean.obs['annotated'].value_counts()
-
-
-
-
-

Task: Visualise your annotation on a UMAP as well as in a matrixplot, dotplot, heatmap or violin plots.

-
-
-
sc.pl.umap(adata = , color='annotated', legend_loc='on data', title='', frameon=False)
-sc.pl.umap(adata = , color='annotated',  title='', frameon=True)
-
-
-
-
-
-
-
sc.pl.dotplot(adata=, 
-              var_names =,
-              groupby=, 
-              use_raw=False)
-
-
-
-
-
-
-
sc.pl.heatmap(adata=, 
-              var_names=, 
-              figsize=(5,10),
-              groupby=, 
-              use_raw=False, vmin=0)
-
-
-
-
-
-
-
sc.pl.matrixplot(adata = , 
-                 var_names = , 
-                 groupby= , 
-                 use_raw=False, vmin=0)
-
-
-
-
-
-
-
sc.pl.stacked_violin(adata=   , 
-                     var_names=   , 
-                     groupby=    , 
-                     use_raw=False)
-
-
-
-
-
-
-
-

Save results for future#

-
-
-
adata_clean
-
-
-
-
-
AnnData object with n_obs × n_vars = 11827 × 17698
-    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'mt_pct_content', 'ribo_pct_content', 'G2_M_checkpoint', 'leiden', 'pheno_leiden'
-    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
-    uns: 'draw_graph', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors_30', 'pca', 'pheno_jaccard_q', 'pheno_leiden_colors', 'umap'
-    obsm: 'X_draw_graph_fa', 'X_pca', 'X_umap'
-    varm: 'PCs'
-    layers: 'raw_data'
-    obsp: 'neighbors_30_connectivities', 'neighbors_30_distances', 'pheno_jaccard_ig'
-
-
-
-
-
-
-
adata_clean = adata_clean.write_h5ad(input_path + 'processed/annotated_data.h5ad')
-
-
-
-
-
- - - - -
- - - - - - -
- -
-
-
- -
- - - - - - - -
-
- - -
- - -