Stop words come up after outlier reduction #2067

rja122277 · 2024-06-24T19:48:40Z

rja122277
Jun 24, 2024

I believe I successfully removed the stop words in the original model, as I don't see them appearing in the topics. However, after reducing the outliers, I notice that all those stop words reappear in the newly set topics. It seems that the stop word removal does not extend to the task of outlier reduction. I need some help with this!

Below is my code:

embedding_model = SentenceTransformer('all-mpnet-base-v2')
umap_model16 = UMAP(n_neighbors=16)
hdbscan_model151 = HDBSCAN(min_cluster_size=15, min_samples=1,
                        gen_min_span_tree=True,
                        prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=mylist)
ctfidf_model = ClassTfidfTransformer()

model1 = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model16,
    hdbscan_model=hdbscan_model151,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    language='english',
    calculate_probabilities=True,
    verbose=True
)
topics1, probs1 = model1.fit_transform(data1sent)

Then I used four different strategies to reduce the outliers.

topic_model = model1

new_topics1 = topic_model.reduce_outliers(data1sent, topics1, probabilities=probs1, strategy="probabilities")
new_topics2 = topic_model.reduce_outliers(data1sent, topics1, strategy="distributions")
new_topics3 = topic_model.reduce_outliers(data1sent, topics1, strategy="c-tf-idf")
new_topics4 = topic_model.reduce_outliers(data1sent, topics1, strategy="embeddings")

In all cases, when I represent topics using get_topics(), I can see all the stopwords that I removed through vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=mylist) in the original model reappearing in the topics.

Is there any way I can remove these stopwords from the topics generated through outlier reduction? I don't want to remove them before generating embeddings because I'm concerned it might ruin the original meanings of each sentence.

Thanks for your time in advance.

MaartenGr · 2024-06-25T12:27:55Z

MaartenGr
Jun 25, 2024
Maintainer

I believe there might be some code missing, like .update_topics from your example. If so, make sure that .update_topics uses the same parameters as when you initialize BERTopic.

5 replies

rja122277 Jun 25, 2024
Author

Oh, I skipped describing this part. I did use .update_topics with the following codes:

topic_model1 = topic_model
topic_model1.update_topics(data1sent, topics=new_topics1)
topic_model1.get_topic_info()

The topics resulting from the last line include the stopwords, which is the problem I'm facing.

I'm not sure if I understand what you mean by "when you initialize BERTopic." Perhaps this is something I missed. Could you tell me more about what it means to initialize BERTopic and how I can do that using .update_topics? Thank you again!

rja122277 Jun 25, 2024
Author

One more related question - Is there any way I can save the probabilities for each topic for each document (data1sent) after updating the topics? Thank you!

MaartenGr Jun 26, 2024
Maintainer

I'm not sure if I understand what you mean by "when you initialize BERTopic." Perhaps this is something I missed. Could you tell me more about what it means to initialize BERTopic and how I can do that using .update_topics? Thank you again!

When you run .update_topics you should do it like this:

topic_model.update_topics(
    docs=docs,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,)

You should include the vectorizer and ctfidf models as noted in the documentation here.

One more related question - Is there any way I can save the probabilities for each topic for each document (data1sent) after updating the topics? Thank you!

You can use topic_model.probabilities_ to access the updated probabilities.

rja122277 Jun 29, 2024
Author

It works very well. I very much appreciate your help!

After reducing outliers and updating topics accordingly, I found that the representative_docs() result still gives the same results as the old model. For example, it still includes the -1 cluster's representative documents, although the new updated model doesn't have -1 cluster anymore. Is there any way to access to new representative documents for new topic clusters?

MaartenGr Jun 29, 2024
Maintainer

Hmmm, that might be a bug, unfortunately. There is the private function called ._extract_representative_docs that you can use to regenerate the representative documents although a fix might already be available in the main branch (but I'm not sure!).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop words come up after outlier reduction #2067

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stop words come up after outlier reduction #2067

rja122277 Jun 24, 2024

Replies: 1 comment · 5 replies

MaartenGr Jun 25, 2024 Maintainer

rja122277 Jun 25, 2024 Author

rja122277 Jun 25, 2024 Author

MaartenGr Jun 26, 2024 Maintainer

rja122277 Jun 29, 2024 Author

MaartenGr Jun 29, 2024 Maintainer

rja122277
Jun 24, 2024

Replies: 1 comment 5 replies

MaartenGr
Jun 25, 2024
Maintainer

rja122277 Jun 25, 2024
Author

rja122277 Jun 25, 2024
Author

MaartenGr Jun 26, 2024
Maintainer

rja122277 Jun 29, 2024
Author

MaartenGr Jun 29, 2024
Maintainer