Skip to content

Commit

Permalink
update README and workflow!
Browse files Browse the repository at this point in the history
  • Loading branch information
jxmorris12 committed Feb 22, 2024
1 parent 768a90b commit 0ccaca1
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 2 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ name: Test with PyTest

on:
push:
branches: [ master ]
branches: [ main ]
pull_request:
branches: [ master ]
branches: [ main ]

jobs:
build:
Expand Down
48 changes: 48 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# bm25-pt

A minimal BM25 implementation using PyTorch. (Also uses [HuggingFace tokenizers](https://huggingface.co/docs/tokenizers/en/index) behind the scenes to tokenize text.)

```bash
pip install bm25_pt
```

## Usage


```python
from bm25_pt import BM25

bm25 = BM25()
corpus = [
"A high weight in tf–idf is reached by a high term frequency",
"(in the given document) and a low document frequency of the term",
"in the whole collection of documents; the weights hence tend to filter",
"out common terms. Since the ratio inside the idf's log function is always",
"greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal",
"to 0. As a term appears in more documents, the ratio inside the logarithm approaches",
"1, bringing the idf and tf–idf closer to 0.",
]
bm25.index(corpus)

queries = ["weights", "ratio logarithm"]
doc_scores = bm25.score_batch(queries)
print(doc_scores)
>> tensor([[0.0000, 0.0000, 1.4238, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 1.5317, 0.0000, 2.0203, 0.0000]])
```

can also call `score()` with a

### Use your own tokenizer

You can use your own tokenizer if you want. Simply provide your tokenizer to the `BM25` constructor:

```python
from bm25_pt import BM25
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("t5-base")
bm25 = BM25(tokenizer=tokenizer)
```

then proceed to use the library as normal.ss
2 changes: 2 additions & 0 deletions bm25_pt/bm25.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import torch
import transformers


def documents_to_bags(docs: torch.Tensor, vocab_size: int) -> torch.sparse.Tensor:
num_docs, seq_length = docs.shape
batch_idxs = torch.arange(num_docs)[:, None].expand(-1, seq_length)
Expand All @@ -15,6 +16,7 @@ def documents_to_bags(docs: torch.Tensor, vocab_size: int) -> torch.sparse.Tenso
vals = (docs > 0).int().flatten()
return torch.sparse_coo_tensor(idxs, vals, size=(num_docs, vocab_size)).coalesce()


class TokenizedBM25:
k1: float
b: float
Expand Down

0 comments on commit 0ccaca1

Please sign in to comment.