See https://github.com/GregorySchwartz/too-many-cells for latest version.
See the bioRxiv paper for more information about the algorithm.
too-many-cells
is a suite of tools, algorithms, and visualizations focusing on
the relationships between cell clades. This includes new ways of clustering,
plotting, choosing differential expression comparisons, and more! While
too-many-cells
was intended for single cell RNA-seq, any abundance data in any
domain can be used. Rather than opt for a unique positioning of each cell using
dimensionality reduction approaches like t-SNE and PCA, too-many-cells
recursively divides cells into clusters and relates clusters rather than
individual cells. In fact, by recursively dividing until further dividing would
be considered noise or random partitioning, we can eliminate noisy relationships
at the fine-grain level. The resulting binary tree serves as a basis for a
different perspective of single cells, using our birch-beer
visualization
and tree measures to describe simultaneously large and small populations,
without additional parameters or runs. See below for a full list of features.
- A new R wrapper was written to quickly get data to and from
too-many-cells
from R. Check it out here! - Now works with Cellranger 3.0 matrices in addition to Cellranger 2.0
- Can prune (make into leaves) specified nodes with
--custom-cut
.
We provide multiple ways to install too-many-cells
. We recommend installing
stack
(docker images and a Dockerfile
to
use in any system in case you have a custom build (for instance, a non-standard
R installation) or difficulty installing. macOS and Windows users:
too-many-cells
was built and tested on linux, so we highly recommend using the
docker image (which a completely isolated environment which requires no
compiling or installation, other than docker itself) as there may be
difficulties in installing the dependencies. There are, however, additional
instructions for macOS here if you really want to compile it.
You may require the following dependencies to build and run (from Ubuntu 14.04, use the appropriate packages from your distribution of choice):
- build-essential
- libgmp-dev
- libblas-dev
- liblapack-dev
- libgsl-dev
- libgtk2.0-dev
- libcairo2-dev
- libpango1.0-dev
- graphviz
- r-base
- r-base-dev
To install them, in Ubuntu:
sudo apt install build-essential libgmp-dev libblas-dev liblapack-dev libgsl-dev libgtk2.0-dev libcairo2-dev libpango1.0-dev graphviz r-base r-base-dev
too-many-cells
also uses the following packages from R:
- cowplot
- ggplot2
- edgeR
- jsonlite
To install them in R,
install.packages(c("ggplot2", "cowplot", "jsonlite"))
install.packages("BiocManager")
BiocManager::install("edgeR")
See https://docs.haskellstack.org/en/stable/README/ for more details.
curl -sSL https://get.haskellstack.org/ | sh
stack setup
Probably the easiest method if you don’t want to mess with dependencies (outside of the ones above).
git clone https://github.com/GregorySchwartz/too-many-cells.git
cd too-many-cells
stack install
We only require stack
(or cabal
), you do not need to download any source
code (but you might need the stack.yaml dependency versions), just run the
following command to place too-many-cells
in your ~/.local/bin/
:
stack install too-many-cells
If you run into errors like Error: While constructing the build plan, the
following exceptions were encountered:
, then follow it’s advice. Usually you
just need to follow the suggestion and add the dependencies to the specified
file. For a quick yaml
configuration, refer to
https://github.com/GregorySchwartz/too-many-cells/blob/master/stack.yaml. Relies
on eigen-3.3.4.1
right now.
Different computers have different setups, operating systems, and repositories.
Do put the entire program in a container to bypass difficulties (with the other
methods above), we user docker
. So first, install docker.
To get too-many-cells
(replace 0.1.5.0 with any version needed):
docker pull gregoryschwartz/too-many-cells:0.1.5.0
To run too-many-cells
in a docker container:
sudo docker run gregoryschwartz/too-many-cells:0.1.5.0 -h
Docker won’t be able to find your files by default. You need to mount the
folders with -v
in order to have docker read and write from and to the
filesystem, respectively. Read the documentation about volumes for more
information. Essentially, -v /path/to/matrix/on/host:/input_matrix
with -m
/input_matrix
is what you want, where before the :
is on the host filesystem
while after the :
is what the docker program sees. Then you can write the
output in the same way: -v /path/to/output/on/host:/output
will write the
output to the folder before the :
.
To build the too-many-cells
image yourself if you want:
git clone https://github.com/GregorySchwartz/too-many-cells.git
cd too-many-cells
docker build -t too-many-cells -f ./Dockerfile .
<<macOS>>
We recommend using docker on macOS. If you need to build too-many-cells
, you
should get the above dependencies. For some dependencies, you can use brewer,
then install too-many-cells
(in the cloned folder, don’t forget to install the
R dependencies above):
brew cask install xquartz
brew install glib cairo gtk gettext fontconfig freetype
brew tap brewsci/bio
brew tap brewsci/science
brew install r zeromq graphviz pkg-config gsl libffi gobject-introspection gtk+ gtk+3
# Needed so pkg-config and libraries can be found.
# For the second path, use the ouput of "brew info libffi".
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
# Tell gtk that it's quartz
stack install --flag gtk:have-quartz-gtk
I am getting errors like AesonException "Error in $.packages.cassava.constraints.flags...
when running stack
commands
Try upgrading stack with stack upgrade
. The new installation will be in
~/.local/bin
, so use that binary.
I use conda or custom ld library locations and I cannot install too-many-cells
or run into weird R errors
stack
and too-many-cells
assume system libraries and programs. To solve this
issue, first install the dependencies above at the system level, including
system R
. Then to every stack
and too-many-cells
command, prepend
PATH="$HOME/.local/bin:/usr/bin:$PATH"
to all commands. For instance:
PATH="$HOME/.local/bin:/usr/bin:$PATH" stack install
PATH="$HOME/.local/bin:/usr/bin:$PATH" too-many-cells make-tree -h
Open an issue! While working on the issue, try out the docker for
too-many-cells
, it requires no installation at all (other than docker).
This project is a collection of libraries and programs written specifically for
too-many-cells
:
-
birch-beer
- Generate a tree for displaying a hierarchy of groups with colors, scaling, and more.
-
modularity
- Find the modularity of a network.
-
spectral-clustering
- Library for spectral clustering.
-
hierarchical-spectral-clustering
- Hierarchical spectral clustering of a graph.
-
differential
- Finds out whether an entity comes from different distributions (statuses).
too-many-cells
has several entry points depending on the desired analysis.
Argument | Analysis |
---|---|
make-tree | Generate the tree from single cell data with various measurement outputs and visualize tree |
interactive | Interactive visuzalization of the tree, very slow |
differential | Find differentially expressed genes between two nodes |
diversity | Conduct diversity analyses of multiple cell populations |
paths | The binary tree equivalent of the so called “pseudotime”, or 1D dimensionality reduction |
The main workflow is to first generate and plot the population tree using
too-many-cells make-tree
, then use the rest of the entry points as needed.
At any point, use -h
to see the help of each entry point.
Also, check out tooManyCellsR for an R wrapper!
<<makeTreeUsage>>
too-many-cells make-tree
generates a binary tree using hierarchical spectral
clustering. We start with all cells in a single node. Spectral clustering
partitions the cells into two groups. We assess the clustering using
Newman-Girvan modularity: if \(Q > 0\) then we recursively continue with
hierarchical spectral clustering. If not, then there is only a single community
and we do not partition – the resulting node is a leaf and is considered the
finest-grain cluster.
The most important argument is the =–prior= argument. Making the tree may
take some time, so if the tree was already generated and other analysis or
visualizations need to be run on the tree, point the --prior
argument to the
output folder from a previous run of too-many-cells
. If you do not use
--prior
, the entire tree will be recalculated even if you just wanted to
change the visualization!
The main input is the --matrix-path
argument. When a directory is supplied,
too-many-cells
interprets the folder to have matrix.mtx
, genes.tsv
, and
barcodes.tsv
files (cellranger
outputs, see cellranger
for specifics). If
a file is supplied instead of a directory, we assume a csv
file containing
gene row names and cell column names. This argument can be called multiple times
to combine multiple single cell matrices: --matrix-path input1 --matrix-path
input2
.
The second most important argument is --labels-file
. Supply with a csv
with
a format and header of “item,label” to provide colorings and statistics of the
relationships between labels. Here the “item” column contains the name of each
cell (barcode) and the label is any property of the cell (the tissue of origin,
hour in a time course, celltype, etc.).
To see the full list of options, use too-many-cells -h
and -h
for each entry
point (i.e. too-many-cells make-tree -h
).
too-many-cells make-tree
generates several files in the output folder. Below
is a short description of each file.
File | Description |
---|---|
clumpiness.csv | When labels are provided, uses the clumpiness measure to determine the level of aggregation between each label within the tree. |
clumpiness.pdf | When labels are provided, a figure of the clumpiness between labels. |
cluster_diversity.csv | When labels are provided, the diversity, or “effective number of labels”, of each cluster. |
cluster_info.csv | Various bits of information for each cluster and the path leading up to each cluster, from that cluster to the root. For instance, the size column has cluster_size/parent_size/parent_parent_size/.../root_size |
cluster_list.json | The json file containing a list of clusterings. |
cluster_tree.json | The json file containing the output tree in a recursive format. |
dendrogram.svg | The visualization of the tree. There are many possible options for this visualization included. Can rename to choose between PNG, PS, PDF, and SVG using --dendrogram-output . |
graph.dot | A dot file of the tree, with less information than the tree in cluster_results.json . |
node_info.csv | Various information of each node in the tree. |
projection.pdf | When --projection is supplied with a file of the format “barcode,x,y”, provides a plot of each cell at the specified x and y coordinates (for instance, when looking at t-SNE plots with the same labelings as the dendrogram here). |
The basic outline of the default pre-processing pipeline with some relevant
options is as follows (there are many additional options including cell
whitelists and PCA that can be seen using too-many-cells make-tree -h
):
- Read matrix.
- Remove cells with less than 250 counts (
--filter-thresholds
,--no-filter
). - Remove genes with less than 1 count (
--filter-thresholds
,--no-filter
). - Term frequency-inverse document frequency normalization (
--normalization
). - Finish.
We start with our input matrix. Here,
ls ./input
barcodes.tsv genes.tsv matrix.mtx
Note that the input can be a directory (with the cellranger
matrix format
above) or a file (a csv
file). You can also point to a cellranger
>= 3.0
folder which has matrix.mtx.gz
, features.tsv.gz
, and barcodes.tsv.gz
files
instead. You don’t need to use scRNA-seq data! You can use any data that has
observations (cells) and features (genes), as long as you agree that the
observations are related by their feature abundances. <<preprocessedData>> If
you do upstream batch effect correction, PCA, normalization, or anything else,
be sure to use --no-filter --normalization NoneNorm
to avoid wrong filters and
scalings! As for formats, the matrix market format contains three files like so:
The matrix.mtx
file is in matrix market format.
%%MatrixMarket matrix coordinate integer general % 23433 1981 4255069 4 1 1 5 1 1 11 1 2 23 1 2 25 1 2 40 1 2 48 1 1 ...
The genes.tsv
file (or features.tsv.gz
) contains the features of each cell
and corresponds to the rows of matrix.mtx
. Here, both columns were the same
gene symbols, but you can have Ensembl as the first column and gene symbol as
the second, etc. The columns and column orders don’t matter, but make sure all
matrices have the same format and specify the symbols you want to use (for
overlaying gene expression, differential expression, etc.) with
--feature-column COLUMN
. So to use the second column for gene expression, you
would use --feature-column 2
.
Xkr4 Xkr4 Rp1 Rp1 Sox17 Sox17 Mrpl15 Mrpl15 Lypla1 Lypla1 Tcea1 Tcea1 Rgs20 Rgs20 Atp6v1h Atp6v1h Oprk1 Oprk1 Npbwr1 Npbwr1 ...
The barcodes.tsv
file contains the ids of each cell or observation and
corresponds to the columns of matrix.mtx
.
AAACCTGCAGTAACGG-1 AAACGGGAGAAGAAGC-1 AAACGGGAGACCGGAT-1 AAACGGGAGCGCTCCA-1 AAACGGGAGGACGAAA-1 AAACGGGAGGTACTCT-1 AAACGGGAGGTGCTTT-1 AAACGGGAGTCGAGTG-1 AAACGGGCATGGTCAT-1 AAAGATGAGCTTCGCG-1 ...
For a csv
file, the format is dense (observation columns (cells), feature rows
(genes)):
"","A22.D042044.3_9_M.1.1","C5.D042044.3_9_M.1.1","D10.D042044.3_9_M.1.1","E13.D042044.3_9_M.1.1","F19.D042044.3_9_M.1.1","H2.D042044.3_9_M.1.1","I9.D042044.3_9_M.1.1",... "0610005C13Rik",0,0,0,0,0,0,0,... "0610007C21Rik",0,112,185,54,0,96,42,... "0610007L01Rik",0,0,0,0,0,153,170,... "0610007N19Rik",0,0,0,0,0,0,0,... "0610007P08Rik",0,0,0,0,0,19,0,... "0610007P14Rik",0,58,0,0,255,60,0,... "0610007P22Rik",0,0,0,0,0,65,0,... "0610008F07Rik",0,0,0,0,0,0,0,... "0610009B14Rik",0,0,0,0,0,0,0,... ...
We also know where each cell came from, so we mark that down as well in a
labels.csv
file.
item,label AAACCTGCAGTAACGG-1,Marrow AAACGGGAGACCGGAT-1,Marrow AAACGGGAGCGCTCCA-1,Marrow AAACGGGAGGACGAAA-1,Marrow AAACGGGAGGTACTCT-1,Marrow ...
This can be easily accomplished with sed
:
cat barcodes.tsv | sed "s/-1/-1,Marrow/" | s/-2/etc... > labels.csv
For cellranger
, note that the -1
, -2
, etc. postfixes denote the first,
second, etc. label in the aggregation csv
file used as input for cellranger
aggr
.
We can now run the too-many-cells
algorithm on our data. The resulting cells
with assigned clusters will be printed to stdout
(don’t forget to use
--no-filter
and --normalization NoneNorm
on preprocessed data, as stated
here).
too-many-cells make-tree \
--matrix-path input \
--labels-file labels.csv \
--draw-collection "PieRing" \
--output out \
> clusters.csv
Large cell populations can result in a very large tree. What if we only want to
see larger subpopulations rather than the large (inner nodes) and small
(leaves)? We can use the --min-size 100
argument to set the minimum size of a
leaf to 100 in this case. Alternatively, we can specify --smart-cutoff 4
in
addition to --min-size 1
to set the minimum size of a node to \(4 *
\text{median absolute deviation (MAD)}\) of the nodes in the original tree.
Varying the number of MADs varies the number of leaves in the tree.
--smart-cutoff
should be used in addition to --min-size
, max-proportion
,
or min-distance
to decide which cutoff variable to use. The value supplied to
the cutoff variable is ignored when --smart-cutoff
is specified. We’ll prune
the tree for better visibility in this document.
Note: the pruning arguments change the tree file, not just the plot, so be sure to output into a different directory.
Also, we do not need to recalculate the entire tree! We can just supply the
previous results using --prior
(we can also remove --matrix-path
with
--prior
to speed things up, but miss out on some features if needed):
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieRing" \
--output out_pruned \
> clusters_pruned.csv
What if we want pie charts instead of showing each individual cell (the default)?
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieChart" \
--output out_pruned \
> clusters_pruned.csv
Now that we see the relationships between clusters and nodes in the dendrogram, how can we go back to the data – which nodes represent which node IDs in the data?
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieChart" \
--draw-node-number \
--output out_pruned \
> clusters_pruned.csv
We can also change the width of the nodes and branches, for instance if we want thinner branches:
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieChart" \
--draw-max-node-size 40 \
--output out_pruned \
> clusters_pruned.csv
We can remove all scaling for a normal tree and still control the branch widths:
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieChart" \
--draw-max-node-size 40 \
--draw-no-scale-nodes \
--output out_pruned \
> clusters_pruned.csv
How strong is each split? We can tell by drawing the modularity of the children on top of each node:
too-many-cells make-tree \
--prior out \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-collection "PieChart" \
--draw-mark "MarkModularity" \
--output out_pruned \
> clusters_pruned.csv
What if we want to draw the gene expression onto the tree in another folder
(requires --matrix-path
, may take some time depending on matrix size. Defaults
to all black if the feature name is not present in the matrix, so check the first
column of the feature file)? Note: the feature names are from the genes.tsv
or
features.tsv.gz
file. Usually, cellranger
has Ensembl identifiers as the
first column and gene symbol as the second column, so if you want to specify
gene symbol, use --feature-column 2
(1 is default).
too-many-cells make-tree \
--prior out \
--matrix-path input \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--feature-column 2 \
--draw-leaf "DrawItem (DrawContinuous \"Cd4\")" \
--output out_gene_expression \
> clusters_pruned.csv
While this representation shows the expression of Cd4 in each cell and blends
those levels together, due to the sparsity of single cell data these cells and
their respective subtrees may be hard to see without additional processing.
Let’s scale the saturation to more clearly see sections of the tree with our
desired expression (when choosing other high and low colors with
--draw-colors
, scaling the saturation will only affect non-grayscale colors).
too-many-cells make-tree \
--prior out \
--matrix-path input \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--feature-column 2 \
--draw-leaf "DrawItem (DrawContinuous \"Cd4\")" \
--draw-scale-saturation 10
--output out_gene_expression \
> clusters_pruned.csv
There, much better! Now it’s clearly enriched in the subtree containing the
thymus, where we would expect many T cells to be. While this tree makes the
expression a bit more visible, there is another tactic we can use. Instead of
the continuous color spectrum of expression values, we can have a binary “high”
and “low” expression. Here, we’ll continue to have the red and gray colors
represent high and low expressions respectively using the --draw-colors
argument. Note that this binary expression technique can be used for multiple
features, hence it’s a list of features with cutoffs so you can be high in a
gene and low in another gene, etc. for all possible combinations.
too-many-cells make-tree \
--prior out \
--matrix-path input \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--feature-column 2 \
--draw-leaf "DrawItem (DrawThresholdContinuous [(\"Cd4\", 0), (\"Cd8a\", 0)])" \
--draw-colors "[\"#e41a1c\", \"#377eb8\", \"#4daf4a\", \"#eaeaea\"]" \
--draw-scale-saturation 10 \
--output out_gene_expression \
> clusters_pruned.csv
Now we can see the expression of both Cd4 and Cd8a at the same time!
We can also see an overview of the diversity of cell labels within each subtree and leaves.
too-many-cells make-tree \
--prior out \
--matrix-path input \
--labels-file labels.csv \
--smart-cutoff 4 \
--min-size 1 \
--draw-leaf "DrawItem DrawDiversity" \
--output out_diversity \
> clusters_pruned.csv
Here, the deeper the red, the more diverse (a larger “effective number of cell states”) the cell labels in that group are. Note that the inner nodes are colored relative to themselves, while the leaves are colored relative to all leaves, so there are two different scales.
The interactive
entry point has a basic GUI interface for quick plotting with
a few features. We recommend limited use of this feature, however,
as it can be quite slow at this stage, has fewer customizations, and requires
specific dependencies.
too-many-cells interactive \
--prior out \
--labels-file labels.csv
A main use of single cell clustering is to find differential genes between
multiple groups of cells. The differential
aids in this endeavor by allowing
comparisons with edgeR
. Let’s find the differential genes between the liver
group and all other cells. Consider our pruned tree from earlier:
We can see the id of each group with --draw-node-number
.
We need to define two groups to compare. Well, it looks like node 98 defines the
liver cluster. Then, since we don’t want 98 to be in the other group, we say
that all other cells are within nodes 89 and 1. As a result, we end up with a
tuple containing two lists: ([89, 1], [98]). Then our differential genes for
(liver / others) can be found with differential
(sent to stdout
):
too-many-cells differential \
--matrix-path input \
-n "([89, 1], [98])" \
> differential.csv
If we wanted to make the same comparison, but compare the liver subtree with
liver cells from all other subtrees, we can use the --labels
argument:
too-many-cells differential \
--matrix-path input \
--labels-file labels.csv \
-n "([89, 1], [98])" \
--labels "([\"Liver\"], [\"Liver\"])" \
> differential_liver.csv
We can also look at the distribution of abundance for individual genes using the
--genes
and --plot-output
arguments.
Furthermore, we can compare each node to all other cells by specifying no nodes
at all. The output file will contain the top --top-n
genes for each node. We
recommend using multiple OS threads here to speed up the process using +RTS
-N${NUMOSTHREADS}
(no number to use all cores). The following example will
compare all nodes to all other cells using 8 OS threads:
too-many-cells differential \
--matrix-path input \
-n "([], [])" \
--normalization "UQNorm" \
+RTS -N8
Diversity is the measure of the “effective number of entities within a system”,
originating from ecology (See Jost: Entropy and Diversity). Here, each cell is
an organism and each cell label or cluster is a species, depending on the
question. In ecology, the diversity index measures the effective number of
species within a population such that the minimum is a diversity of 1 for a
single dominant species up to maximum of the total number of species (evenly
abundant). If our species is a cluster, then here the diversity is the effective
number of cell states within a population (for labels, make-tree
generates
these results automatically in “diversity” columns). Say we have two populations
and we generated the trees using make-tree
into two different output folders,
out1
and out2
. We can find the diversity of each population using the
diversity
entry point.
too-many-cells diversity\
--priors out1 \
--priors out2 \
-o out_diversity_stats
We can then find a simple plot of diversity in diversity_output
. In addition,
we also provide rarefaction curves for comparing the number of different cell
states at each subsampling useful for comparing the number of cell states where
the population sizes differ.
“Pseudotime” refers to the one dimensional relationship between cells, useful
for looking at the ordering of cell states or labels. The implementation of
pseudotime in a too-many-cells
point-of-view is by finding the distance
between all cells and the cells found in the longest path from the root in the
tree. Then each cell has a distance from the “start” and thus we plot those
distances.
too-many-cells paths\
--prior out \
--labels-file labels.csv \
--bandwidth 3 \
-o out_paths
Each entry point has its own documentation accessible with -h
, such as
too-many-cells make-tree -h
:
too-many-cells -h
too-many-cells, Gregory W. Schwartz. Clusters and analyzes single cell data. Usage: too-many-cells (make-tree | interactive | differential | diversity | paths) Available options: -h,--help Show this help text Available commands: make-tree interactive differential diversity paths
Check out an instructional example of using too-many-cells
here when finished
looking at the brief feature overview.