Best practice for BERTopic.merge_models #1705

Matagi1996 · 2023-12-20T02:17:14Z

Matagi1996
Dec 20, 2023

As I understand it, this function merges the topics of several trained models, but not their Cluster (HDBSCAN) or PCA model parameters.
As such, am I right to assume it is not recommended to split a large dataset in several batches (models) and combine all of them later (kind of like a makeshift online mode) if want the to predict topics based on all datapoints?

MaartenGr · 2023-12-20T06:49:59Z

MaartenGr
Dec 20, 2023
Maintainer

.merge_models was actually created, in part, for that specific feature. You can find a guide for incremental learning with that feature here. I would definitely advise using it for incremental learning.

Even though it does not save cluster and dimensionality reduction parameters, it can still be used for incremental learning if the data chunks are large enough. The only thing to note is that the CountVectorizer is not merged (due to potentially different tokenization schemes).

0 replies

Matagi1996 · 2023-12-20T08:33:31Z

Matagi1996
Dec 20, 2023
Author

Thanks, I noticed today while looking into the code that when I safe with savetensor thouse parameters are not safed anyway so I might have worried to much.
I updated to the new version to test merge and am getting
BERTopic - WARNING: You are loading a BERTopic model without explicitly defining an embedding model

I pass this embeddingmodel what is simply a convinience wrapper around Sentencetransformer MiniLM-L12-v2 to push it to GPU and increese the embeding size a little bit, with the previouse version that worked well but now I always get the warning.

BERTopic.load(path_to_file, embedding_model=Embedder())

class Embedder(BaseEmbedder):
      def __init__(self, device:str="cuda",max_len:int=256):
          super().__init__()
          self.embedding_model =en_emb(device,max_len)
  
      def embed(self, documents, verbose=False):
          embeddings = self.embedding_model.encode(documents,normalize_embeddings=True)
          return embeddings


 def en_emb(device:str="cuda",max_len=256):
    cache = os.environ['TRANSFORMERS_CACHE']
    id = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
    model =SentenceTransformer(model_name_or_path=id, device = device, cache_folder= cache)
    model.max_seq_length=max_len
    return model

Not critical but just I am just wondering what changed.

1 reply

MaartenGr Dec 20, 2023
Maintainer

It's difficult to say without seeing an end-to-end example (some code is missing in your example) but my guess would be that since there was no embedding model saved, it cannot overwrite it:

BERTopic/bertopic/_bertopic.py

Line 3060 in 1285f54

    
           if embedding_model is not None and type(topic_model.embedding_model) == BaseEmbedder:

This might be a too restrictive statement. Having said that, you can also just run the following instead:

from bertopic._bertopic import select_backend
topic_model.embedding_model = select_backend(Embedder())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for BERTopic.merge_models #1705

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Best practice for BERTopic.merge_models #1705

Matagi1996 Dec 20, 2023

Replies: 2 comments · 1 reply

MaartenGr Dec 20, 2023 Maintainer

Matagi1996 Dec 20, 2023 Author

MaartenGr Dec 20, 2023 Maintainer

Matagi1996
Dec 20, 2023

Replies: 2 comments 1 reply

MaartenGr
Dec 20, 2023
Maintainer

Matagi1996
Dec 20, 2023
Author

MaartenGr Dec 20, 2023
Maintainer