Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM42 #2075

Closed
wants to merge 3 commits into from
Closed

BM42 #2075

wants to merge 3 commits into from

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Jul 5, 2024

EDIT: Until this gets resolved I'm not sure whether I want this implemented at the current moment. There seem to be some strange things going on with the evaluation and although this implementation is not geared towards RAG, it doesn't feel right to continue working on this.

What does this PR do?

The hybrid nature of BERTopic (Bag-of-Words and semantic representations) can be generalized even to the topic representations it creates using a modified version of BM42. It works as follows:

First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the nr_samples parameter.

Then, the top n representative documents are extracted by calculating the c-TF-IDF representation for the candidate documents.

For all representative documents per topic, their attention matrix is calculated and all weights are summed. The weights are then multiplied by the IDF values of BERTopic's c-TF-IDF algorithm to get the final BM42 representation. These IDF values are either extracted from creating a new c-TF-IDF on the representativate documents (recalculate_idf=True) or by taking the IDF values of the c-TF-IDF model that was trained on the entire corpus (recalculate_idf=False).

Thus, the algorithm follows some principles of BM42 but does some optimization in
order to speed up inference and it uses the IDF values of c-TF-IDF. Usage is straightforward:

from bertopic.representation import BM42Inspired
from bertopic import BERTopic

# Create your representation model
representation_model = BM42Inspired(
    "sentence-transformers/all-MiniLM-L6-v2",
    recalculate_idf=True
)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

@MaartenGr MaartenGr closed this Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant