Add Model2Vec as an embedding backend #2245
Open
+240
−15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add Model2Vec as an incredibly fast but still quite accurate embedding backend.
Usage is straightforward and you first need to install model2vec:
Then, you can load in any of their models and pass it to BERTopic like so:
Distillation
These models are extremely versatile and can be distilled from existing embedding model (like those compatible with
sentence-transformers
). This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to use the vocabulary from your input documents to distill a model yourself.Doing so requires you to install some additional dependencies of model2vec like so:
To then distill common embedding models, you need to import the
Model2VecBackend
from BERTopic:You can also choose a custom vectorizer for creating the vocabulary and define custom arguments for the distillatio process:
Before submitting