BM42 #2075

MaartenGr · 2024-07-05T14:13:09Z

EDIT: Until this gets resolved I'm not sure whether I want this implemented at the current moment. There seem to be some strange things going on with the evaluation and although this implementation is not geared towards RAG, it doesn't feel right to continue working on this.

What does this PR do?

The hybrid nature of BERTopic (Bag-of-Words and semantic representations) can be generalized even to the topic representations it creates using a modified version of BM42. It works as follows:

First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the nr_samples parameter.

Then, the top n representative documents are extracted by calculating the c-TF-IDF representation for the candidate documents.

For all representative documents per topic, their attention matrix is calculated and all weights are summed. The weights are then multiplied by the IDF values of BERTopic's c-TF-IDF algorithm to get the final BM42 representation. These IDF values are either extracted from creating a new c-TF-IDF on the representativate documents (recalculate_idf=True) or by taking the IDF values of the c-TF-IDF model that was trained on the entire corpus (recalculate_idf=False).

Thus, the algorithm follows some principles of BM42 but does some optimization in
order to speed up inference and it uses the IDF values of c-TF-IDF. Usage is straightforward:

from bertopic.representation import BM42Inspired
from bertopic import BERTopic

# Create your representation model
representation_model = BM42Inspired(
    "sentence-transformers/all-MiniLM-L6-v2",
    recalculate_idf=True
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?

MaartenGr and others added 3 commits July 5, 2024 16:11

Added an modified version of BM42

1d688c9

Correct model path

69a20a9

testing something

4a9d7ea

MaartenGr closed this Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM42 #2075

BM42 #2075

MaartenGr commented Jul 5, 2024 •

edited

Loading

BM42 #2075

BM42 #2075

Conversation

MaartenGr commented Jul 5, 2024 • edited Loading

What does this PR do?

Before submitting

MaartenGr commented Jul 5, 2024 •

edited

Loading