Using Merge Multiple Fitted Models in combination with Topic Distributions #2166

GabyMU · 2024-10-03T15:28:26Z

GabyMU
Oct 3, 2024

Hi everyone,

I'm currently working with a large dataset (around 7 million documents) and trying to apply BERTopic in batches, with the goal of later merging the results. However, when I try to use the .approximate_distribution function, I run into the following error:

Traceback (most recent call last):
at /opt/python/envs/default/lib/python3.8/site-packages/bertopic/_bertopic.py, line 1346, in approximate_distribution(self, documents, window, stride, min_similarity, batch_size, padding, use_embedding_model, calculate_tokens, separator)
at /opt/python/envs/default/lib/python3.8/site-packages/sklearn/feature_extraction/text.py, line 1430, in transform(self, raw_documents)
at /opt/python/envs/default/lib/python3.8/site-packages/sklearn/feature_extraction/text.py, line 510, in _check_vocabulary(self)
NotFittedError: Vocabulary not fitted or provided

It seems like the vocabulary isn't fitted, which is causing the error. Does anyone know why this is happening? Is it possible to merge the batches and still use .approximate_distribution effectively?

Thanks in advance!

MaartenGr · 2024-10-04T07:11:35Z

MaartenGr
Oct 4, 2024
Maintainer

You get this error because you merge multiple BERTopic models with different c-TF-IDF matrices that you currently cannot combine (although it is theoretically possible). Imagine that you have two models with different vocabularies, you would then have to find a way to combine them when merging.

Instead, if you still have the data, you could do something like .update_topics which will recalculate the topics and therefore the c-TF-IDF representations.

Note that you can still do approximate_distribution with embeddings which are saved during a merge.

2 replies

GabyMU Oct 4, 2024
Author

Hi Maarten,

Thanks for the quick response, would be interesting if it would be possible in the future as I am using the merging function soley for a memory problem.

Just to understand your respons, while update_topics, should I provide my own c-TF-IDF?

Kind regards

MaartenGr Oct 7, 2024
Maintainer

Just to understand your respons, while update_topics, should I provide my own c-TF-IDF?

If you have a c-TF-IDF model with adjusted parameters, then yes. Otherwise, you can leave it without. Do note though that it should take the form of your initial parameters. So if it will revert back to defaults if you do not specifically add them. For instance, if you used representation_model before, you will have to do it again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Merge Multiple Fitted Models in combination with Topic Distributions #2166

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Using Merge Multiple Fitted Models in combination with Topic Distributions #2166

GabyMU Oct 3, 2024

Replies: 1 comment · 2 replies

MaartenGr Oct 4, 2024 Maintainer

GabyMU Oct 4, 2024 Author

MaartenGr Oct 7, 2024 Maintainer

GabyMU
Oct 3, 2024

Replies: 1 comment 2 replies

MaartenGr
Oct 4, 2024
Maintainer

GabyMU Oct 4, 2024
Author

MaartenGr Oct 7, 2024
Maintainer