Topic embedding dimensions do not match with topic representation dimensions after outlier reduction #1552

VickyAnP · 2023-09-29T09:45:44Z

VickyAnP
Sep 29, 2023

Hi @MaartenGr! Thank you for the package. Really useful for an array of applications!

It would be great if you can suggest some directions with a few issues i encounterd:

When I apply outlier reduction, the topic embedding array does not match in dimensions with the new topic data frame: topic_model.get_topic_info() . Of course it reduces the noise topic. I am not sure though now that the topic embeddings match the topic representations. Is it as simple as ignoring the first value in the array?

I use the model to extract topics from around 300.000 documents. I should have in principle around 5%-10% outliers. But if i don't apply any outlier reduction i get almost half of my documents being in the noise cluster based on the model below. Do you have any suggestion what to do in this case?

Model:
sentence_model = SentenceTransformer(MODEL)

umap_model_chosen = UMAP(n_neighbors=10, min_dist=0.05, metric='cosine',random_state=42)

hdbscan_model_chosen = HDBSCAN(min_cluster_size=10, metric='euclidean', gen_min_span_tree = True,
cluster_selection_method='eom', prediction_data=True, min_samples=5)

vectorizer_model = CountVectorizer(stop_words="english", lowercase=True, ngram_range=(1, 2), min_df=2)

ctfidf_model_chosen = ClassTfidfTransformer(reduce_frequent_words=True)

representation_model_chosen = MaximalMarginalRelevance(diversity=0.9)

topic_model = BERTopic(umap_model = umap_model_chosen,
hdbscan_model = hdbscan_model_chosen,
vectorizer_model = vectorizer_model,
calculate_probabilities = False, # set it to False to speed up computation time
nr_topics = 654,
language ="english",
# embedding_model = sentence_model,
ctfidf_model = ctfidf_model_chosen,
representation_model = representation_model_chosen,
top_n_words = 50,
min_topic_size = 100)
topics_calc, probs = topic_model.fit_transform(abstracts_text, abstract_embeddings)

Reduce outliers using the `c-tf-idf` strategy

new_topics = topic_model.reduce_outliers(abstracts_text, topics_calc)
new_topics = topic_model.reduce_outliers(abstracts_text, topics_calc, strategy="c-tf-idf")
embedd = topic_model.topic_embeddings_

MaartenGr · 2023-10-03T11:53:20Z

MaartenGr
Oct 3, 2023
Maintainer

When I apply outlier reduction, the topic embedding array does not match in dimensions with the new topic data frame: topic_model.get_topic_info() .

That might be as a result of using abstract_embeddings. Which dimensions do they have and which is smaller/bigger? Also .reduce_outlierse does not influence BERTopic at all and should not change the embedding sizes.

I use the model to extract topics from around 300.000 documents. I should have in principle around 5%-10% outliers. But if i don't apply any outlier reduction i get almost half of my documents being in the noise cluster based on the model below. Do you have any suggestion what to do in this case?

You could try playing around with the min_cluster_size and min_samples in HDBSCAN since they have the biggest influence on outlier creation. You can find a few more tips in the FAQ.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic embedding dimensions do not match with topic representation dimensions after outlier reduction #1552

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Topic embedding dimensions do not match with topic representation dimensions after outlier reduction #1552

VickyAnP Sep 29, 2023

Reduce outliers using the c-tf-idf strategy

Replies: 1 comment

MaartenGr Oct 3, 2023 Maintainer

VickyAnP
Sep 29, 2023

Reduce outliers using the `c-tf-idf` strategy

MaartenGr
Oct 3, 2023
Maintainer