Handling Dynamic Data Sizes and Incremental Training in BERTopic for Production Use #2119
Replies: 1 comment 2 replies
-
This is indeed quite difficult if you are using the Also, note that if you have trained UMAP on a large amount of data before, you could use that trained model for subsequent models if you do not expect the data to differ that much from what UMAP was trained on.
Merging models is what I would typically advise currently as it seems to me as the most stable approach. Although you could use the Having said that, the Also note that BERTopic is quite flexible and modular, which means that there are a lot of "tricks" (some mentioned above) you can do in order to improve the output or adjust it to your specific use case. Likewise, this means that there is also plenty of room for improvement, so any suggestions are appreciated. |
Beta Was this translation helpful? Give feedback.
-
I'm currently implementing BERTopic in a production environment and facing two significant challenges that I'd appreciate some guidance on:
How can I dynamically adjust BERTopic's parameters to handle varying data sizes efficiently in a production setting? Are there best practices or recommended approaches to make BERTopic's parameters adaptable to different data scales?
I need to train the model daily with new data, essentially requiring an incremental approach to topic modeling. My current plan is to save the trained BERTopic model after each training session and then merge the new model with the previously saved model when new data arrives. Is merging models the best approach for incremental training in BERTopic? If not, what alternative strategies would you recommend for maintaining and updating a topic model with new data daily?
Beta Was this translation helpful? Give feedback.
All reactions