Online Topic Modeling Placing Similar Documents into Different Topics #929

vantubbe · 2023-01-13T22:32:11Z

vantubbe
Jan 13, 2023

I'm using online topic modeling with River, and update the topic model 1k documents per batch. I've noticed some instances where bertopic assigns documents to a different topic even though there's an existing doc&topic that are extreamly similar. Some examples below where different news headlines of Presley's death got put into different topics.

Not sure how to approach this, would greatly appreciate any guidance on how to guide the model better. Huge fan of Bertopic!

Topic Name
Document

73_daughter elvis_daughter elvis presley_abuse_lord
Lisa Marie Presley, singer, songwriter and daughter of Elvis Presley has died at the age of 54.

87_arrest lisa marie_arrest lisa_cardiac arrest lisa_marie presley died
Lisa Marie Presley died after suffering cardiac arrest.

61_relationships_passion_introverts_raider
Lisa Marie Presley, daughter of Elvis, died aged 54.

45_crying_slut_tears_im crying
Lisa Marie Presley, the only child of Elvis Presley has died at 54.

51_sanha_dogs_rampal ji maharaj_sant
Lisa Marie Presley Dies at 54.

70_parker_wes christian_singer daughter elvis_baby shark
Lisa Marie Presley died.

Answered by MaartenGr

Jan 16, 2023

@vantubbe In part, it depends on the sub-models that you used to perform the online topic modeling such as the embedding model, dimensionality reduction, clustering, etc. That all can greatly influence how the topics are being clustered. Perhaps the embedding model is trained to focus on a specific part of the text and less on the context, perhaps the dimensionality reduction algorithm needs more or le

It is difficult to say without actually seeing the code and knowing which sub-models are being used. Having said that, it might be worthwhile to check out some of the parameter tunings here. Also, you could use the .clusters attribute in the river algorithm to checkout some of the clusters …

View full answer

EtienneAb3d · 2023-01-16T06:26:20Z

EtienneAb3d
Jan 16, 2023

We are facing similar problems...

Remark: our experiment was done with 9K documents, being all about 1 page of text.

First, we observed that BERTopic is not reproducible : sending a document several times is bringing with different results. Searching on this, we found documentations/discussions explaining that UMAP is including a random behaviour in its functioning. We fixed this by enforcing all vectors to be process one by one (especially in the training process) with an identical random sequence init (Seed).

We get a system where, when sending a document already used in the training process, we properly get it back with a cosin distance of 1, and an Euclidian distance of 0.

We then observed that BERTopic is hyper-sensitive to the input content: when changing a single word in a document already used for the training, the nearest document (with both cosin and euclidian distances) becomes a document with a lot of differences (perhaps 1/3 of the document is different).

Reading somewhere that HDBSCAN is doing some approximations in its calculation, we then tried to replace UMAP+HDBSCAN by PCA+KMeans... same hyper-sensivity.

This is really not the behaviour we were expected from such a tool: how can BERTopic group similar documents together if its vectorisation is hyper-sensitive to very small changes?

Bug?
Any tip to solve this?

0 replies

MaartenGr · 2023-01-16T07:03:29Z

MaartenGr
Jan 16, 2023
Maintainer

@vantubbe In part, it depends on the sub-models that you used to perform the online topic modeling such as the embedding model, dimensionality reduction, clustering, etc. That all can greatly influence how the topics are being clustered. Perhaps the embedding model is trained to focus on a specific part of the text and less on the context, perhaps the dimensionality reduction algorithm needs more or le

It is difficult to say without actually seeing the code and knowing which sub-models are being used. Having said that, it might be worthwhile to check out some of the parameter tunings here. Also, you could use the .clusters attribute in the river algorithm to checkout some of the clusters or even predict all original documents to see if they are still put into the generated microclusters.

@EtienneAb3d I might be mistaken here, but based on the sub-models that you mention we are not talking about online topic modeling right?

First, we observed that BERTopic is not reproducible : sending a document several times is bringing with different results. Searching on this, we found documentations/discussions explaining that UMAP is including a random behaviour in its functioning. We fixed this by enforcing all vectors to be process one by one (especially in the training process) with an identical random sequence init (Seed).

You can find a bit more about UMAP and this process in the FAQ here. You can also find a link there to the UMAP documentation where this is discussed a bit more in detail.

We then observed that BERTopic is hyper-sensitive to the input content: when changing a single word in a document already used for the training, the nearest document (with both cosin and euclidian distances) becomes a document with a lot of differences (perhaps 1/3 of the document is different).

Reading somewhere that HDBSCAN is doing some approximations in its calculation, we then tried to replace UMAP+HDBSCAN by PCA+KMeans... same hyper-sensivity.

It is difficult to say without seeing the actual code, since hypterprameters of the sub-models can influence this greatly, but it may also depend on the chosen embedding model and the number of topics that are being generated. For example, a word embedding model might place more emphasis on single words then a sentence-transformer model. Moreover, if you generate many topics, which you can control with k-Means or min_cluster_size in HDBSCAN, then it is indeed likely that many micro-clusters are being created.

This is really not the behaviour we were expected from such a tool: how can BERTopic group similar documents together if its vectorisation is hyper-sensitive to very small changes?

It may not necessarily be that the vectorization (at least if we are talking about the BoW step) is hyper-sensitive but the process before that may be the culprit here. The topic representation step is mostly influenced by how the documents are being brought together and to a lesser extent by changing a single word. My guess would be that there is much to gain in the steps before BoW.

Having said that, could you share some of your code illustrating this issue?

3 replies

vantubbe Jan 16, 2023
Author

@MaartenGr Thank you so much for the detailed answer and suggestions! Would be very happy to share the code. I mostly followed the guides with a focus on River.

Settings

import storage_blob.blob_manager as bm
from river import cluster
from river import stream
from sentence_transformers import SentenceTransformer
from bertopic.vectorizers import OnlineCountVectorizer, ClassTfidfTransformer

class River:
    def __init__(self, model):
        self.model = model

    def partial_fit(self, umap_embeddings):
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            self.model = self.model.learn_one(umap_embedding)

        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)

        self.labels_ = labels
        return self

    def predict(self, umap_embeddings):
        labels = []
        for umap_embedding, _ in stream.iter_array(umap_embeddings):
            label = self.model.predict_one(umap_embedding)
            labels.append(label)
        return labels

# embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
# Tokenize topics
vectorizer_model = OnlineCountVectorizer(stop_words="english", ngram_range=(1, 3))
# Create topic representation
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True, bm25_weighting=True)
river_cluster_model = River(cluster.DBSTREAM())

def river_topic_model():
    return BERTopic(embedding_model=embedding_model,                                  
        hdbscan_model=river_cluster_model,
        umap_model=umap_model,
        ctfidf_model=ctfidf_model,
        vectorizer_model=vectorizer_model)

topic_model = river_topic_model()

Initial Training

def init_train_incremental_model(file_name):
    global topic_model

    df = bm.load_topic_csv(file_name)

    topic_model = river_topic_model()
    corpus = preprocess_corpus_df(df)
    topic_model.partial_fit(corpus)

    topics = topic_model.topics_

    set_topics_labels()

    save_model(model_path)

    df_topicInfo = topic_model.get_topic_info()

    #get doc topics dataframe
    docTopicItms = []
    for idx, topicIdx in enumerate(topics):
        dctop = DocumentTopic(idx, idx, topicIdx, -1)
        docTopicItms.append(dctop)
    df_docTopics = pd.DataFrame.from_records([s.to_dict() for s in docTopicItms])

    #get mappings dataframe
    mappings = topic_model.topic_mapper_.get_mappings()
    df_mappings = pd.DataFrame(mappings.items(), columns=['source', 'destination'])

    #save result files
    bm.save_topic_results(df_topicInfo, df_docTopics, df_mappings, file_name)

Incremental Training

def incremental_train_model(file_name):
    global topic_model

    df = bm.load_topic_csv(file_name)

    partial_corpus = preprocess_corpus_df(df)
    existing_topics = topic_model.topics_
    
    topic_model.partial_fit(partial_corpus)

    topics = topic_model.topics_

    set_topics_labels()

    #save_model(model_path)

    df_new_topicInfo = topic_model.get_topic_info()
    #set custom names
    #df_new_topicInfo['CustomName'] = df_new_topicInfo.apply(lambda row: topic_model.custom_labels_[int(row.index)], axis=1)
    df_new_topicInfo['CustomName'] = len(df_new_topicInfo.index) * ['Custom-Name']

    #get mappings dataframe
    mappings = topic_model.topic_mapper_.mappings_
    df_mappings = pd.DataFrame(mappings, columns=['source', 'destination'])

    #Process new topics
    #get doc topics dataframe
    docTopicItms = []
    for idx, topicIdx in enumerate(topics):
        dctop = DocumentTopic(idx, idx, topicIdx, -1)
        docTopicItms.append(dctop)
    df_new_doctopics = pd.DataFrame.from_records([s.to_dict() for s in docTopicItms])

    #extend the existing topics
    existing_topics.extend(topic_model.topics_)
    topic_model.topics_ = existing_topics

    #save result files
    bm.save_topic_results(df_new_topicInfo, df_new_doctopics, df_mappings, file_name)

Preprocessing

def preprocess_corpus_df(df_sentences):
    final_corpus = []
    
    for index, row in df_sentences.iterrows():
        temp = re.sub('\s+', ' ', row['text'])
        val = re.sub("\'", "", temp)

        if len(val) > 10:
            cleaning_array = ''
            for character in val:
                if character.isalpha() or character == ' ':
                    cleaning_array += character.lower()
            final_corpus.append(cleaning_array)
        else:
            final_corpus.append(val)

    return final_corpus

MaartenGr Jan 17, 2023
Maintainer

There is quite a lot going on in your code, so it is difficult to say what exactly might be happening here. My guess would be that micro-clusters are created as a result of the clustering algorithm, so some tuning there might be necessary. You might be able to test this by replacing River with MiniBatchkMeans and see how the clusters are being separated.

vantubbe Jan 19, 2023
Author

Excellent advice, very much appreciate everything you do. Thank you again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online Topic Modeling Placing Similar Documents into Different Topics #929

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Online Topic Modeling Placing Similar Documents into Different Topics #929

vantubbe Jan 13, 2023

Replies: 2 comments · 3 replies

EtienneAb3d Jan 16, 2023

MaartenGr Jan 16, 2023 Maintainer

vantubbe Jan 16, 2023 Author

Settings

Initial Training

Incremental Training

Preprocessing

MaartenGr Jan 17, 2023 Maintainer

vantubbe Jan 19, 2023 Author

vantubbe
Jan 13, 2023

Replies: 2 comments 3 replies

EtienneAb3d
Jan 16, 2023

MaartenGr
Jan 16, 2023
Maintainer

vantubbe Jan 16, 2023
Author

MaartenGr Jan 17, 2023
Maintainer

vantubbe Jan 19, 2023
Author