Replies: 1 comment 2 replies
-
You get this error because you merge multiple BERTopic models with different c-TF-IDF matrices that you currently cannot combine (although it is theoretically possible). Imagine that you have two models with different vocabularies, you would then have to find a way to combine them when merging. Instead, if you still have the data, you could do something like Note that you can still do |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I'm currently working with a large dataset (around 7 million documents) and trying to apply BERTopic in batches, with the goal of later merging the results. However, when I try to use the .approximate_distribution function, I run into the following error:
Traceback (most recent call last):
at /opt/python/envs/default/lib/python3.8/site-packages/bertopic/_bertopic.py, line 1346, in approximate_distribution(self, documents, window, stride, min_similarity, batch_size, padding, use_embedding_model, calculate_tokens, separator)
at /opt/python/envs/default/lib/python3.8/site-packages/sklearn/feature_extraction/text.py, line 1430, in transform(self, raw_documents)
at /opt/python/envs/default/lib/python3.8/site-packages/sklearn/feature_extraction/text.py, line 510, in _check_vocabulary(self)
NotFittedError: Vocabulary not fitted or provided
It seems like the vocabulary isn't fitted, which is causing the error. Does anyone know why this is happening? Is it possible to merge the batches and still use .approximate_distribution effectively?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions