Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
EDIT: Until this gets resolved I'm not sure whether I want this implemented at the current moment. There seem to be some strange things going on with the evaluation and although this implementation is not geared towards RAG, it doesn't feel right to continue working on this.
What does this PR do?
The hybrid nature of BERTopic (Bag-of-Words and semantic representations) can be generalized even to the topic representations it creates using a modified version of BM42. It works as follows:
First, we extract the top n representative documents per topic. To extract the representative documents, we randomly sample a number of candidate documents per cluster which is controlled by the
nr_samples
parameter.Then, the top n representative documents are extracted by calculating the c-TF-IDF representation for the candidate documents.
For all representative documents per topic, their attention matrix is calculated and all weights are summed. The weights are then multiplied by the IDF values of BERTopic's c-TF-IDF algorithm to get the final BM42 representation. These IDF values are either extracted from creating a new c-TF-IDF on the representativate documents (
recalculate_idf=True
) or by taking the IDF values of the c-TF-IDF model that was trained on the entire corpus (recalculate_idf=False
).Thus, the algorithm follows some principles of BM42 but does some optimization in
order to speed up inference and it uses the IDF values of c-TF-IDF. Usage is straightforward:
Before submitting