Handling Dynamic Data Sizes and Incremental Training in BERTopic for Production Use #2119

ShakilMahmudShuvo · 2024-08-12T05:31:00Z

ShakilMahmudShuvo
Aug 12, 2024

I'm currently implementing BERTopic in a production environment and facing two significant challenges that I'd appreciate some guidance on:

How can I dynamically adjust BERTopic's parameters to handle varying data sizes efficiently in a production setting? Are there best practices or recommended approaches to make BERTopic's parameters adaptable to different data scales?
I need to train the model daily with new data, essentially requiring an incremental approach to topic modeling. My current plan is to save the trained BERTopic model after each training session and then merge the new model with the previously saved model when new data arrives. Is merging models the best approach for incremental training in BERTopic? If not, what alternative strategies would you recommend for maintaining and updating a topic model with new data daily?

MaartenGr · 2024-08-12T08:30:49Z

MaartenGr
Aug 12, 2024
Maintainer

How can I dynamically adjust BERTopic's parameters to handle varying data sizes efficiently in a production setting? Are there best practices or recommended approaches to make BERTopic's parameters adaptable to different data scales?

This is indeed quite difficult if you are using the .merge_models technique since we do not know beforehand what the best parameters are for each data size. Here, partial_fit might be preferred but will suffer from similar issues. In practice, it might be easier to create a new model each time your data reaches a certain size and then train a new model to add. You could also just choose a very small min_topic_size and create models with many topics to merge. This will create models with many topics but allow you to quickly generate new topics.

Also, note that if you have trained UMAP on a large amount of data before, you could use that trained model for subsequent models if you do not expect the data to differ that much from what UMAP was trained on.

I need to train the model daily with new data, essentially requiring an incremental approach to topic modeling. My current plan is to save the trained BERTopic model after each training session and then merge the new model with the previously saved model when new data arrives. Is merging models the best approach for incremental training in BERTopic? If not, what alternative strategies would you recommend for maintaining and updating a topic model with new data daily?

Merging models is what I would typically advise currently as it seems to me as the most stable approach. Although you could use the partial_fit method, it has some issues with stability as the underlying algorithm for dimensionality reduction and clustering generally aren't as powerful as the default alternatives.

Having said that, the merge_models functionality isn't without its flaws and might have a bit of a "cold-start" problem since the first topics that you create are considered to be the base topics on which you build corresponding merges. So it's generally best to start with a larger topic model and then add smaller ones, than vice versa.

Also note that BERTopic is quite flexible and modular, which means that there are a lot of "tricks" (some mentioned above) you can do in order to improve the output or adjust it to your specific use case. Likewise, this means that there is also plenty of room for improvement, so any suggestions are appreciated.

2 replies

AIFahim Aug 13, 2024

But @MaartenGr , When used .merge_models for several topic models larger one with data of 40% and Rests(other 3 small models) 20% of data each. Then always larger one output I get, Didn't get smaller data model topics. Why?

MaartenGr Aug 17, 2024
Maintainer

@AIFahim I don't understand what you mean. Are you saying that when attempt to merge models, you only get the topics from the larger model? If so, make sure to share your code and demonstrate what is going wrong here. Without additional information, I can't easily help you. Be as complete as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Dynamic Data Sizes and Incremental Training in BERTopic for Production Use #2119

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Handling Dynamic Data Sizes and Incremental Training in BERTopic for Production Use #2119

ShakilMahmudShuvo Aug 12, 2024

Replies: 1 comment · 2 replies

MaartenGr Aug 12, 2024 Maintainer

AIFahim Aug 13, 2024

MaartenGr Aug 17, 2024 Maintainer

ShakilMahmudShuvo
Aug 12, 2024

Replies: 1 comment 2 replies

MaartenGr
Aug 12, 2024
Maintainer

MaartenGr Aug 17, 2024
Maintainer