Skip to content

Commit

Permalink
Merge pull request #85 from databio/region2vec_updates
Browse files Browse the repository at this point in the history
Fix few errors in Region2Vec documentation
  • Loading branch information
nleroy917 authored Dec 11, 2024
2 parents 11c9679 + e864507 commit ba3a55d
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 11 deletions.
4 changes: 2 additions & 2 deletions docs/geniml/tutorials/fine-tune-region2vec-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ class EnhancerClassifier(nn.Module):
After instantiating the tokenizer, we can can use the model like so:
```python
from geniml.io import Region
from geniml.tokenization import ITTokenizer
from geniml.tokenization import TreeTokenizer

r = Region("chr1", 1_000_000, 1_000_500) # some enhancer region (maybe)

tokenizer = ITTokenizer.from_pretrained("databio/r2v-ChIP-atlas-hg38-v2")
tokenizer = TreeTokenizer.from_pretrained("databio/r2v-ChIP-atlas-hg38-v2")
classifier = EnhancerClassifier(model.model) # get the inner core of the model

x = tokenizer.tokenize(r)
Expand Down
6 changes: 3 additions & 3 deletions docs/geniml/tutorials/pre-tokenization.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ Pretokenizing data is easy. You can use the built-in tokenizers and utilities in

```python
from genimtools.utils import write_tokens_to_gtok
from geniml.tokenization import ITTokenizer
from geniml.tokenization import TreeTokenizer

# instantiate a tokenizer
tokenizer = ITTokenizer("path/to/universe.bed")
tokenizer = TreeTokenizer("path/to/universe.bed")

# get tokens
tokens = tokenizer.tokenize("path/to/bedfile.bed")
write_tokens_to_gtok(tokens.ids, "path/to/bedfile.gtok")
write_tokens_to_gtok("path/to/bedfile.gtok", tokens.to_ids())
```

Thats it! Now you can use the `.gtok` file to train a model.
Expand Down
4 changes: 2 additions & 2 deletions docs/geniml/tutorials/train-region2vec.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ import os
from multiprocessing import cpu_count

from geniml.io import RegionSet
from geniml.tokenization import ITTokenizer
from geniml.tokenization import TreeTokenizer
from geniml.region2vec import Region2VecExModel
from rich.progress import track

Expand All @@ -32,7 +32,7 @@ universe_path = os.path.expandvars("$RESOURCES/regions/genome_tiles/tiles1000.hg
data_path = os.path.expandvars("$DATA/ChIP-Atlas/hg38/ATAC_seq/tokens")

model = Region2VecExModel(
tokenizer=ITTokenizer(universe_path),
tokenizer=TreeTokenizer(universe_path),
)
```

Expand Down
8 changes: 4 additions & 4 deletions docs/geniml/tutorials/train-scembed-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,16 +57,16 @@ To learn more about pre-tokenizing the data, see the [pre-tokenization tutorial]

```python
from genimtools.utils import write_tokens_to_gtok
from geniml.tokenization import ITTokenizer
from geniml.tokenization import TreeTokenizer

adata = sc.read_h5ad("path/to/adata.h5ad")
tokenizer = ITTokenizer("peaks.bed")
tokenizer = TreeTokenizer("peaks.bed")

tokens = tokenizer(adata)

for i, t in enumerate(tokens):
file = f"tokens{i}.gtok"
write_tokens_to_gtok(t, file)
filename = f"tokens{i}.gtok"
write_tokens_to_gtok(filename, t.to_ids())
```

### Training the model
Expand Down

0 comments on commit ba3a55d

Please sign in to comment.