Best practice for BERTopic.merge_models #1705
Replies: 2 comments 1 reply
-
Even though it does not save cluster and dimensionality reduction parameters, it can still be used for incremental learning if the data chunks are large enough. The only thing to note is that the CountVectorizer is not merged (due to potentially different tokenization schemes). |
Beta Was this translation helpful? Give feedback.
-
Thanks, I noticed today while looking into the code that when I safe with savetensor thouse parameters are not safed anyway so I might have worried to much. I pass this embeddingmodel what is simply a convinience wrapper around Sentencetransformer MiniLM-L12-v2 to push it to GPU and increese the embeding size a little bit, with the previouse version that worked well but now I always get the warning. BERTopic.load(path_to_file, embedding_model=Embedder())
Not critical but just I am just wondering what changed. |
Beta Was this translation helpful? Give feedback.
-
As I understand it, this function merges the topics of several trained models, but not their Cluster (HDBSCAN) or PCA model parameters.
As such, am I right to assume it is not recommended to split a large dataset in several batches (models) and combine all of them later (kind of like a makeshift online mode) if want the to predict topics based on all datapoints?
Beta Was this translation helpful? Give feedback.
All reactions