diff --git a/docs/geniml/datasets.md b/docs/geniml/datasets.md
new file mode 100644
index 0000000..68fc02f
--- /dev/null
+++ b/docs/geniml/datasets.md
@@ -0,0 +1,61 @@
+# Genomic datasets with `geniml`
+
+## Overview
+Genomic datasets are well known for their incredible size. Therefore, using these for machine learning pipelines requires clever strategies and considerations to effectively handle the large volumes of data. This is particularly problematic when training new models. To that end, `geniml` has two separate workflows for handling genomic data: one for model training and a second for model inference (using a pre-trained model). These workflows have big differences on *when* and *where* genomic datasets are tokenized and stored.
+
+## Model training
+Model training, especially pre-training, usually requires large datasets with billions of genomic tokens. These datasets are way too large to fit in memory and therefore must be streamed from disk during training. Because of this, the data must be tokenized "on the fly" for each epoch. This is wildly inefficient and for particularly large datasets, results in the majority of training time being dedicated to tokenization alone. Therefore, the data need be *pre-tokenized* into an intermediate form which can then be streamed in for each epoch. This removes tokenization entirely from the training procedure and therefore increase efficiency (Fig. 1A).
+
+## Model inference
+Model inference is the process of utilizing a pre-trained model to analyze some new, unseen data. While the output of the model might vary (embeddings, label, new data), the input is always the same: genomic tokens. Except under rare circumstances, it's not typical that model inference involves large volumes of data. Therefore, pre-tokenization is *not necessary*. Because of this, data is to be tokenized **in-memory** and directly to the format required to pass through the model (Fig. 1B).
+
+## Tokenization forms
+Given the above requirements, tokenizers need to be able to output tokenized genomic data into different forms and locations. For model training: tokenizers should take either bed files or `.h5ad` single-cell datasets and convert them into an intermediary `.gtok` file format. These `.gtok` files will be directly consumed during model training. For model inference: tokenizers should take either bed files or `.h5ad` single-cell datasets and output an in-memory representation of these tokens; typically in the form of a `torch.Tensor` or python list. The following table summarizes the format, location, and scenario in which data is tokenized:
+
+
+
+| | Where | What | When |
+| --------------- | --------- | ------------- | ----------------- |
+| Model training | On disk | `.gtok` files | Prior to training |
+| Model inference | In memory | `torch.Tensors` | On the fly |
+
+
+
+
+
+
+
+## Datasets in `geniml`
+`geniml` uses `pytorch` + `lightning` to train models. This ecosystem encourages the use of `torch`s built-in `Dataset` class to parallelize and batch the loading of data. Because training and fine-tuning models requires pre-tokenized data (`.gtok` files), `geniml` needs datasets to handle this. It most likely will look like:
+```python
+from typing import List
+
+from torch.data.utils import IterableDataset
+
+class PretokenizedDataset(IterableDataset):
+ def __init__(self, data: Union[str, List[str]):
+ self.data = data
+ if isinstance(data, str):
+ self._is_folder = True
+ elif isinstance(data, list) and isinstance(data[0], str):
+ self._is_folder = False
+ else:
+ raise ValueError("`data` must be a path to a folder or a list of `.gtok` files")
+
+ def __iter__(self, indx: int):
+ if self._is_folder:
+ for file in os.listdir(self.data):
+ with open(file, 'r') as f:
+ for line in file.readlines():
+ yield line.split(",")
+ else:
+ for file in self.data:
+ with open(file, 'r') as f:
+ for line in file.readlines():
+ yield line.split(",")
+
+```
+Here, we are no longer tokenizing each epoch, rather just streaming in data that has already been pre-tokenized. I still need to think about this in the context of fine-tuning and datasets that require targets and labels.
+
+## `.gtok` file format
+The `.gtok` file format is a binary file where each token is stored as a 32-bit integer. This allows tokens to be stored in a very compact format with the ability to represent up to 4 billion unique tokens. Using our companion package `genimtools`, we can convert `.bed` files into `.gtok` files after tokenization.
\ No newline at end of file
diff --git a/docs/geniml/img/geniml_tokenization_strategy.png b/docs/geniml/img/geniml_tokenization_strategy.png
new file mode 100644
index 0000000..bc9e69b
Binary files /dev/null and b/docs/geniml/img/geniml_tokenization_strategy.png differ
diff --git a/docs/geniml/img/geniml_tokenization_strategy.svg b/docs/geniml/img/geniml_tokenization_strategy.svg
new file mode 100644
index 0000000..ce991b2
--- /dev/null
+++ b/docs/geniml/img/geniml_tokenization_strategy.svg
@@ -0,0 +1,2187 @@
+
+
+
+
diff --git a/docs/geniml/manuscripts.md b/docs/geniml/manuscripts.md
new file mode 100644
index 0000000..794fc0f
--- /dev/null
+++ b/docs/geniml/manuscripts.md
@@ -0,0 +1,17 @@
+# Published manuscripts describing geniml components
+
+If you find `geniml` useful for your research, please cite us! It helps us convince others that our work is useful. Here is a list of published papers that describe different modules or components in the `geniml` package.
+
+---
+
+N. J. LeRoy et al., “Fast clustering and cell-type annotation of scATACdata with pre-trained embeddings,” bioRxiv, 2023, doi: 10.1101/2023.08.01.551452.
+
+J. Rymuza et al., “Methods for constructing and evaluating consensus genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.
+
+E. Gharavi, N. J. LeRoy, G. Zheng, A. Zhang, D. E. Brown, and N. C. Sheffield, “Joint representation learning for retrieval and annotation of genomic interval sets,” bioRxiv, 2023, doi: 10.1101/2023.08.21.554131.
+
+G. Zheng et al., “Methods for evaluating unsupervised vector representations of genomic regions,” bioRxiv, 2023, doi: 10.1101/2023.08.03.551899.
+
+B. Xue, O. Khoroshevskyi, R. A. Gomez, and N. C. Sheffield, “Opportunities and challenges in sharing and reusing genomic interval data,” Frontiers in Genetics, vol. 14, 2023-03, doi: 10.3389/fgene.2023.1155809.
+
+E. Gharavi et al., “Embeddings of genomic region sets capture rich biological associations in low dimensions,” Bioinformatics, 2021-03, doi: 10.1093/bioinformatics/btab439.
\ No newline at end of file
diff --git a/docs/geniml/tutorials/cell-type-annotation-with-knn.md b/docs/geniml/tutorials/cell-type-annotation-with-knn.md
new file mode 100644
index 0000000..3d887e8
--- /dev/null
+++ b/docs/geniml/tutorials/cell-type-annotation-with-knn.md
@@ -0,0 +1,58 @@
+# How to annotate cell-types with KNN
+In the [previous tutorial](./load-qdrant-with-cell-embeddings.md), we loaded a vector database with cell embeddings. In this tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
+
+If you have **not** completed the previous tutorial, you should ensure you have a vector database with cell embeddings.
+
+## What is K-nearest-neighbors (KNN) classification?
+According to [IBM](https://www.ibm.com/topics/knn), K-nearest-neighbors classification is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. Point more simply: KNN is a classification algorithm that uses the distance between an **unlabeled** data point and its **labed** neighbors to classify the new data point.
+
+Assuming we have a vector-space of well-annotated cell embeddings, we can use KNN to classify new cell embeddings based on their proximity to the labeled cell embeddings.
+
+## Querying the vector database
+First, we need to generate new cell embeddings for the cells we want to annotate. **Note: it is imperative that the new cell embeddings are generated using the same model as the cell embeddings in the vector database.** The previous tutorial used `databio/r2v-luecken2021-hg38-v2` to generate cell embeddings. We will use the same model to generate new cell embeddings.
+
+```python
+import scanpy as sc
+
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("path/to/adata_unlabeled.h5ad")
+
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+```
+
+We can get embeddings of the dataset using the pre-trained model:
+
+```python
+embeddings = model.encode(adata)
+
+adata.obsm['scembed_X'] = np.array(embeddings)
+```
+
+Now that we have the new cell embeddings, we can query the vector database to find the K-nearest-neighbors of each cell embedding.
+
+```python
+from collections import Counter
+from qdrant_client import QdrantClient
+
+client = QdrantClient("localhost", port=6333)
+
+# Query the vector database
+k = 5 # set to whatever value you want, this is a hyperparameter
+
+for i, embedding in enumerate(embeddings):
+ neighbors = client.search(
+ collection_name="luecken2021",
+ query_vector=embedding.tolist(),
+ limit=k,
+ with_payload=True
+ )
+ cell_types = [neighbor.payload["cell_type"] for neighbor in neighbors]
+
+ # get majority
+ cell_type = Counter(cell_types).most_common(1)[0][0]
+ adata.obs['cell_type'][i] = cell_type
+```
+
+And just like that, we've annotated our cells with cell-type labels using KNN classification. We can improve this methodology by first clustering the unlabeled cells and then using the cluster centroids to query the vector database. This will reduce the number of queries and improve the speed of the annotation process. Another approach would be to do a secondary consensus vote on each cluster and assign one label per cluster.
+
diff --git a/docs/geniml/tutorials/integrate-with-snapatac2.md b/docs/geniml/tutorials/integrate-with-snapatac2.md
new file mode 100644
index 0000000..0e225b7
--- /dev/null
+++ b/docs/geniml/tutorials/integrate-with-snapatac2.md
@@ -0,0 +1,93 @@
+# Using our models with SnapATAC2
+## Overview
+
+[SnapATAC2](https://github.com/kaizhang/SnapATAC2) is a flexible, versatile, and scalable single-cell omics analysis framework. It is designed to process and analyze single-cell ATAC-seq data. SnapATAC2 is written in Rust with Python bindings. It seemlessly integrates with `scanpy` and `anndata` objects. Therefore, it is extremely easy to use `geniml` models with SnapATAC2. Here's how you can do it:
+
+## Install tools
+
+Ensure that you have `geniml` and `SnapATAC2` installed. You can install both using `pip`:
+```bash
+pip install geniml snapatac2
+```
+
+## Download some data
+
+To get started, let's download some single-cell ATAC-seq data. We will use the [10x Genomics PBMC 10k dataset](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-atac-v2-chromium-controller-2-standard). The dataset contains 10,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor.
+
+You can easily grab the fragment files like so:
+```bash
+wget "https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz" -O pbmc_fragments.tsv.gz
+```
+
+## Pre-process with SnapATAC2
+
+Lets start by pre-processing the data with SnapATAC2. We will closely follow the [SnapATAC2 tutorial](https://kzhang.org/SnapATAC2/tutorials/pbmc.html) to get the data into an `anndata` object.
+
+### Import the data
+
+Lets import the data into `snapatac2`:
+```python
+from pathlib import Path
+import snapatac2 as snap
+
+fragment_file = Path("pbmc_fragments.tsv.gz")
+data = snap.pp.import_data(
+ fragment_file,
+ chrom_sizes=snap.genome.hg38,
+ file="pbmc.h5ad", # Optional
+ sorted_by_barcode=False,
+)
+```
+### Run some basic quality control
+
+Using the `snapatac2` quality control functions, we can quickly assess the quality of the data:
+
+```python
+snap.pl.frag_size_distr(data, interactive=False)
+fig = snap.pl.frag_size_distr(data, show=False)
+fig.update_yaxes(type="log")
+fig.show()
+
+snap.metrics.tsse(data, snap.genome.hg38)
+snap.pl.tsse(data, interactive=False)
+
+snap.pp.filter_cells(data, min_counts=5000, min_tsse=10, max_counts=100000)
+```
+
+Next, we can add a tile matrix to the data, select features, and run `scrublet` which is a doublet detection algorithm:
+```python
+snap.pp.add_tile_matrix(data)
+snap.pp.select_features(data, n_features=250000)
+snap.pp.scrublet(data)
+
+# actually filter the cells
+snap.pp.filter_doublets(data)
+```
+
+With this, we have a clean `anndata` object that we can use with `geniml`.
+
+### Analyze with geniml
+
+We will use a Region2Vec model to cluster the cells by generating embeddings. This PBMC data comes from peripheral blood mononuclear cells (PBMCs) from a healthy donor. As such. we will use the `databio/r2v-luecken2021-hg38-v2` model to generate embeddings because it contains embeddings for the Luecken2021 dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
+
+```python
+import numpy as np
+import scanpy as sc
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("pbmc.h5ad")
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+
+adata.obsm['scembed_X'] = np.array(model.encode(adata))
+```
+
+With the embeddings, we can run a usual workflow like UMAP, clustering, and visualization:
+```python
+sc.pp.neighbors(adata, use_rep="scembed_X")
+sc.tl.umap(adata)
+
+sc.tl.leiden(adata)
+sc.pl.umap(adata, color="leiden")
+```
+
+And that's it! You've now used `geniml` with SnapATAC2. You can use the embeddings to annotate cell types, or perform other analyses. If you want to learn more about this, check out the [cell-type annotation](./cell-type-annotation-with-knn.md) tutorial.
diff --git a/docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md b/docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md
new file mode 100644
index 0000000..8188c9e
--- /dev/null
+++ b/docs/geniml/tutorials/load-qdrant-with-cell-embeddings.md
@@ -0,0 +1,82 @@
+# How to load a vector database with cell embeddings
+## Overview
+
+In this tutorial, we will show how to load a vector database with cell embeddings. There are many benefits to storing cell-embeddings in a vector database:
+1. **Speed**: Loading a vector database is much faster than re-encoding cells.
+2. **Reproducibility**: You can share your cell embeddings with others.
+3. **Flexibility**: You can use the same cell embeddings for many different analyses.
+4. **Interoperability**: You can use the same cell embeddings with many different tools.
+
+In a subsequent tutorial, we will show how to use a vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
+
+## Preqrequisites
+There are two core components to this tutorial: 1) the pre-trained model, and 2) the vector database.
+
+**Pre-trained model:**
+I will be using the `databio/luecken2021` model. It was trained on the [Luecken2021](https://openreview.net/forum?id=gN35BGa1Rt) dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
+
+**Vector database:**
+Vector databases are a new and exciting technology that allow you to store and query high-dimensional vectors very quickly. This tutorial will use the `qdrant` vector database. As a lab, we really like `qdrant` because it is fast, easy to use, and has a great API. You can learn more about `qdrant` [here](https://qdrant.com/). For `qdrant` setup, please refer to the [qdrant documentation](https://qdrant.com/docs/). In the end, you should have a running `qdrant` instance at `http://localhost:6333`.
+
+## Data preparation
+
+Grab a fresh copy of the Luecken2021 data from the [geo accession](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122). We want the `multiome` data. This dataset contains the binary accessibility matrix, the peaks, and the barcodes. It also conveniently contains the cell-type labels. Pre-trained models also requires that the data be in a `scanpy.AnnData` format and the `.var` attribute contain `chr`, `start`, and `end` values.
+
+```python
+import scanpy as sc
+
+adata = sc.read_h5ad("path/to/adata.h5ad")
+adata = adata[:, adata.var['feature_types'] == 'ATAC']
+```
+
+## Getting embeddings
+We can easily get embeddings of the dataset using the pre-trained model:
+
+```python
+import scanpy as sc
+
+from geniml.scembed import ScEmbed
+
+adata = sc.read_h5ad("path/to/adata.h5ad")
+
+model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
+embeddings = model.encode(adata)
+
+adata.obsm['scembed_X'] = np.array(embeddings)
+```
+
+## Loading the vector database
+With the embeddings, we can now upsert them to `qdrant`. Ensure you have `qdrant_client` installed:
+
+```bash
+pip install qdrant-client
+```
+
+```python
+from qdrant_client import QdrantClient
+from qdrant_client.http.models import Distance, VectorParams, PointStruct
+
+client = QdrantClient("localhost", port=6333)
+
+client.create_collection(
+ collection_name="luecken2021",
+ vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.DOT),
+)
+
+embeddings, cell_types = adata.obsm['scembed_X'], adata.obs['cell_type']
+
+points = []
+for embedding, cell_type, i in zip(embeddings, cell_types, range(len(embeddings)):
+ points.append(
+ PointStruct(
+ id=adata.obs.index[i],
+ vector=embedding.tolist(),
+ payload={"cell_type": cell_type}
+
+ ))
+
+
+client.upsert(collection_name="luecken2021", points=points, wait=True)
+```
+
+You should now have a vector database with cell embeddings. In the next tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
diff --git a/docs/geniml/tutorials/pre-tokenization.md b/docs/geniml/tutorials/pre-tokenization.md
new file mode 100644
index 0000000..1998b9f
--- /dev/null
+++ b/docs/geniml/tutorials/pre-tokenization.md
@@ -0,0 +1,44 @@
+# Pre-tokening training data
+## Overview
+
+Before we train a model, we must do what is called *pre-tokenization*. Pre-tokenziation is the process of converting raw genomic region data into lists of tokens and saved into a special file format we call `.gtok` (genomic token) files. These are binary files that contain the tokenized data in the form of integers. The `.gtok` files are used directly to train the model. There are several benefits to this, including:
+1. **Speed**: The tokenization process can be slow, especially when you have a lot of data. By pre-tokenizing the data, you only have to do this once. Then, you can use the `.gtok` files to train the model as many times as you want.
+2. **Memory**: The `.gtok` files are much smaller than the original data. This means that you can store more data in memory and train larger models. Moreover, this enables streaming the data from disk, which is useful when you have a lot of data.
+3. **Reproducibility**: By saving the tokenized data, you can ensure that the same data is used to train the model every time. This is important for reproducibility.
+
+## How to pretokenize data
+Pretokenizing data is easy. You can use the built-in tokenizers and utilities in `geniml` to do this. Here is an example of how to pretokenize a bed file:
+
+```python
+from genimtools.utils import write_tokens_to_gtok
+from geniml.tokenization import ITTokenizer
+
+# instantiate a tokenizer
+tokenizer = ITTokenizer("path/to/universe.bed")
+
+# get tokens
+tokens = tokenizer.tokenize("path/to/bedfile.bed")
+write_tokens_to_gtok(tokens.ids, "path/to/bedfile.gtok")
+```
+
+Thats it! Now you can use the `.gtok` file to train a model.
+
+## How to use the `.gtok` files
+
+To facilitate working with `.gtok` files, we have some helper-classes that can be used to train a model directly from `.gtok` files. For example, you can use teh `Region2VecDataset` class to load the `.gtok` files and train a model. See the [training documentation](./train-region2vec.md) for more information.
+
+```python
+from geniml.region2vec.utils import Region2VecDataset
+
+tokens_dir = "path/to/tokens"
+dataset = Region2VecDataset(tokens_dir)
+
+for tokens in dataset:
+ # train the model
+ print(tokens) # [42, 101, 99, ...]
+```
+
+## Caveats and considerations
+When pretokenizing data, you should consider the following:
+1. Tokens are specific to the universe that the tokenizer was trained on. If you use a different universe, you will get different tokens. This means that you should use the same universe to pretokenize the data as you will use to train the model.
+2. The `.gtok` files are binary files. This means that they are not human-readable. You should keep the original bed files as well as the `.gtok` files. This is important for reproducibility and for debugging.
\ No newline at end of file
diff --git a/docs/geniml/tutorials/train-region2vec.md b/docs/geniml/tutorials/train-region2vec.md
index 2de4f56..5c7a47b 100644
--- a/docs/geniml/tutorials/train-region2vec.md
+++ b/docs/geniml/tutorials/train-region2vec.md
@@ -1,11 +1,13 @@
# How to train a new Region2Vec model
Region2Vec is an unsupervised method for creating embeddings of genomic regions and genomic region sets. This tutorial discusses how to train a new model. To learn how to use a pre-trained model see the [region2vec usage documentation](./use-pretrained-region2vec-model.md).
-## Training data and tokenization
-Training a model requires two things: 1) a set of genomic regions and 2) a tokenizer. The tokenizer is used to convert the genomic regions into tokens. The tokens are then used to train the model. A safe choice for the tokenizer is the tiled hg38 genome. However, you can define your own tokenizer if you want to use a different genome or if you want to use a different tokenization strategy.
+## Training data and universe
+Training a model requires two things: 1) a set of pre-tokenized data and 2) a universe. The universe is a set of regions that the model will be trained on. The universe is used to create the tokenizer, which is used to convert the raw data into tokens. The universe should be representative of the data that you will be training the model on. For example, if you are training a model on human data, you should use a universe that contains human regions. If you dont have a universe, a safe bet is to use the 1000 tiles hg38 genome.
You can download the 1000 tiles hg38 genome [here](https://big.databio.org/geniml/universes/tiles1000.hg38.bed).
+The pre-tokenized data is a set of `.gtok` files. These are binary files that contain the tokenized data in the form of integers. The `.gtok` files are used directly to train the model. If you have not pre-tokenized your data, see the [pre-tokenization documentation](./pre-tokenization.md).
+
## Training a model
### Instantiate a new model
To begin, create a new model from `Region2VecExModel`.
@@ -27,7 +29,7 @@ logging.basicConfig(level=logging.INFO)
# get the paths to data
universe_path = os.path.expandvars("$RESOURCES/regions/genome_tiles/tiles1000.hg38.bed")
-data_path = os.path.expandvars("$DATA/ChIP-Atlas/hg38/ATAC_seq/")
+data_path = os.path.expandvars("$DATA/ChIP-Atlas/hg38/ATAC_seq/tokens")
model = Region2VecExModel(
tokenizer=ITTokenizer(universe_path),
@@ -35,24 +37,20 @@ model = Region2VecExModel(
```
### Training data
-Create a list of `RegionSet`s by supplying paths to bed files:
+
+The training data is a set of `.gtok` files. You can use the `Region2VecDataset` class to load the `.gtok` files and train the model.
```python
-# get list of all files in the data directory
-# convert to backed RegionSet objects
-files = os.listdir(data_path)
-data = [
- RegionSet(os.path.join(data_path, f), backed=True)
- for f in track(files, total=len(files))
-]
-```
+from geniml.region2vec.utils import Region2VecDataset
-> Note: Setting `backed=True` means that the data will be loaded into memory lazily. This is useful when you have a lot of data and you don't want to load it all into memory at once.
+dataset = Region2VecDataset(data_path)
+```
## Training
+
Now, simply give the model the list of `RegionSet`s and run the training:
```python
# train the model
-model.train(data, epochs=100)
+model.train(dataset, epochs=100)
```
You can export your model using the `export` function:
diff --git a/docs/geniml/tutorials/train-scembed-model.md b/docs/geniml/tutorials/train-scembed-model.md
index 60537e0..a18bb80 100644
--- a/docs/geniml/tutorials/train-scembed-model.md
+++ b/docs/geniml/tutorials/train-scembed-model.md
@@ -6,6 +6,7 @@ For this example we are using the [10x Genomics PBMC 10k dataset](https://www.10
## Installation
+
Simply install the parent package `geniml` from PyPi:
```bash
@@ -15,27 +16,9 @@ pip install geniml
Then import `scEmbed` from `geniml`:
```python
-from geniml.scembed import SCEmbed
-```
-
-## Usage
-`scEmbed` is simple to use. Import your `AnnData` object and pass it to `SCEmbed`. `scEmbed` will work regardless of any `var` or `obs` annotations, but it is recommended that you use `scEmbed` after you have already performed some basic filtering and quality control on your data. Further. If you'd like to maintain information about region embeddings, it is recommended that you attach `chr`, `start`, adn `end` annotations to your `AnnData` object before passing it to `scEmbed`.
-
-```python
-import scanpy as sc
-from geniml.scembed import SCEmbed
-
-adata = sc.read_h5ad("path/to/adata.h5ad")
-
-scEmbed = SCEmbed(adata)
-scEmbed.train(
- epochs=100,
-)
-
-cell_embeddings = scEmbed.get_cell_embeddings()
+from geniml.scembed import ScEmbed
```
-
## Data preparation
`scembed` requires that the input data is in the [AnnData](https://anndata.readthedocs.io/en/latest/) format. Moreover, the `.var` attribute of this object must have `chr`, `start`, and `end` values. The reason is two fold: 1) we can track which vectors belong to which genmomic regions, and 2) region vectors are now reusable. We ned three files: 1) The `barcodes.txt` file, 2) the `peaks.bed` file, and 3) the `matrix.mtx` file. These will be used to create the `AnnData` object. To begin, download the data from the 10x Genomics website:
@@ -65,57 +48,71 @@ adata.write_h5ad("pbmc.h5ad")
We will use the `pbmc.h5ad` file for downstream work.
-## Training the model
-To train an `scembed` model, just create an instance of the `SCEmbed` model class, define your hyperparamters, and call the `train` method. For this example, we will use the default hyperparameters. The only thing we need to specify is the number of epochs to train for. We will train for 10 epochs:
+## Training
+
+Training an `scEmbed` model requires two key steps: 1) pre-tokenizing the data, and 2) training the model.
+
+### Pre-tokenizing the data
+To learn more about pre-tokenizing the data, see the [pre-tokenization tutorial](./pre-tokenization.md). Pre-tokenization offers many benefits, the two most important being 1) speeding up training, and 2) lower resource requirements. The pre-tokenization process is simple and can be done with a combination of `geniml` and `genimtools` utilities. Here is an example of how to pre-tokenize the 10x Genomics PBMC 10k dataset:
```python
-import logging
+from genimtools.utils import write_tokens_to_gtok
+from geniml.tokenization import ITTokenizer
-import scanpy as sc
-from geniml.scembed import SCEmbed
+adata = sc.read_h5ad("path/to/adata.h5ad")
+tokenizer = ITTokenizer("peaks.bed")
-# if you want to see the training progress
-logging.basicConfig(level=logging.INFO)
+tokens = tokenizer(adata)
-model = SCEmbed(
- use_default_region_names=False # this is to specify that we want to use chr, start, end.
-)
-model.train(adata, epochs=3) # we recomend increasing this to 100
+for i, t in enumerate(tokens):
+ file = f"tokens{i}.gtok"
+ write_tokens_to_gtok(t, file)
```
-Thats it!
-
-## Clustering the cells
-With the model now trained, we can get embeddings of our cells. This occurs in two steps: 1) tokenize the data and 2) encode the cells.
+### Training the model
-**Tokenize:**
+Now that the data is pre-tokenized, we can train the model. The `scEmbed` model is designed to be used with `scanpy`. Here is an example of how to train the model:
```python
-from geniml.tokenization import HardTokenizer
+from geniml.region2vec.utils import Region2VecDataset
-tokenizer = HardTokenizer("peaks.bed") # consensus peak set
+dataset = Region2VecDataset("path/to/tokens")
-region_sets = tokenizer(adata)
+model = ScEmbed(
+ tokenizer=tokenizer,
+)
+model.train(
+ dataset,
+ epochs=100,
+)
```
-**Encode:**
+We can then export the model for upload to huggingface:
+
```python
-cell_embeddings = []
-for region_set in region_sets:
- embedding = np.mean([model(r) for r in region_set if r in model.region2vec, axis=0])
- cell_embeddings.append(embedding)
+model.export("path/to/model")
+```
+
+## Get embeddings of single-cells
+`scEmbed` is simple to use and designed to be used with `scanpy`. Here is a simple example of how to train a model and get cell embeddings:
-cell_embeddings = np.array(cell_embeddings)
+```python
+model = ScEmbed.from_pretrained("path/to/model")
+model = ScEmbed("databio/scembed-pbmc10k")
-adata.obsm['scembed_X'] = cell_embeddings
+adata = sc.read_h5ad("path/to/adata.h5ad")
+embeddings = model.encode(adata)
+
+adata.obsm["scembed_X"] = embeddings
```
-Now you can use `scanpy` utilities to cluster the cells:
+## Clustering the cells
+With the model now trained, and cell-embeddings obtained, we can get embeddings of our individual cells. You can use `scanpy` utilities to cluster the cells:
```python
sc.pp.neighbors(adata, use_rep="scembed_X")
-sc.tl.leiden(adata)
+sc.tl.leiden(adata) # or louvain
```
And visualize with UMAP
@@ -123,4 +120,4 @@ And visualize with UMAP
```python
sc.tl.umap(adata)
sc.pl.umap(adata, color="leiden")
-```
\ No newline at end of file
+```
diff --git a/docs/geniml/tutorials/use-pretrained-region2vec-model.md b/docs/geniml/tutorials/use-pretrained-region2vec-model.md
index 877e328..e82c55b 100644
--- a/docs/geniml/tutorials/use-pretrained-region2vec-model.md
+++ b/docs/geniml/tutorials/use-pretrained-region2vec-model.md
@@ -8,7 +8,7 @@ To use one of our pre-trained models, simply import the `Region2VecExModel` and
from geniml.io import Region
from geniml.region2vec import Region2VecExModel
-model = Region2VecExModel("databio/r2v-ChIP-atlas-hg38-v2")
+model = Region2VecExModel("databio/r2v-encode-hg38")
```
> Note: We use the `Region2VecExModel` class to load the model because it is an extension of the `Region2Vec` class that comes with its own tokenizer. This ensures that the model and tokenizer are compatible.
@@ -20,8 +20,8 @@ Now we can encode a genomic region or a list of regions:
region = Region("chr1", 160100, 160200)
regions = [region] * 10
-embedding = model.encode(region)
-embeddings = model.encode(regions)
+embedding = model.encode(region) # get one region embedding
+embeddings = model.encode(regions) # or, get many embeddings
```
We can also encode an entire bed file, which will return region embeddings for each region in the file:
@@ -32,24 +32,14 @@ bed = "/path/to/bed/file.bed"
embeddings = model.encode(bed)
```
-> Note: It is not uncommon for a region to not be tokenizable by the tokenizer. This is because the tokenizer was trained on a specific set of regions. If this is the case, the model simply returns `None` for the embedding of that region. If you want to override this behavior, you can set `return_none=False` in the `encode` function. This will skip that region in the return. However, we do not recommend this as it removes the ability to distinguish between regions that are tokenizable and regions that are not. The output shape will no longer match the input shape.
+> Note: It is possible that a region can not be tokenized by the tokenizer. This is because the tokenizer was instantiated with a specific set of regions. If this is the case, the model simply returns the unknown token (`chrUNK-0:0`). If you find that this is happening often, you may want to ensure that your regions are a good fit for the universe of regions that the model was trained on. The unknown token will indeed have an embedding, but it will not be a meaningful representation of the region.
## Region pooling
-It is often the case that we want a single embedding that represents a set of regions. For example, we may want to encode a patient by taking the average embedding of all the SNPs in the patient's genome. We can do this by using the `pool` argument. Out of the box, we support `mean` and `max` pooling. For example:
+It is often the case that we want a single embedding that represents a set of regions. For example, we may want to encode a patient by taking the average embedding of all the SNPs in the patient's genome. We can do this by simply averaging across the embeddings of the regions:
```python
patient_snps = "/path/to/bed/file.bed"
-embedding = model.encode(patient_snps, pool="mean") # or pool="max"
-```
-
-You can also supply a custom function to do pooling. For example, if you wanted to take the sum of the embeddings, you could do:
-
-```python
-def sum_pooling(embeddings: np.ndarray):
- return np.sum(embeddings, axis=0)
-
-embedding = model.encode(patient_snps, pool=sum_pooling)
-```
-
-`pool` is `False` by default. Setting to `True` defaults to `mean` pooling.
+embeddings = model.encode(patient_snps)
+patient_embedding = np.mean(embeddings, axis=0)
+```
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index 7b899f0..e635b92 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -76,6 +76,8 @@ nav:
- Evaluate embeddings: geniml/tutorials/evaluation.md
- Train region2vec embeddings: geniml/tutorials/region2vec.md
- Train single-cell embeddings: geniml/tutorials/train-scembed-model.md
+ - Load vector database with embeddings: tutorials/load-qdrant-with-cell-embeddings.md
+ - Cell-type prediction using KNN: tutorials/cell-type-annotation-with-knn.md
- Tokenization: geniml/tutorials/tokenization.md
- Tokenize a BED file on the command line: geniml/tutorials/cli-tokenization.md
- Create consensus peaks: geniml/tutorials/create-consensus-peaks.md