Replies: 1 comment
-
That might be as a result of using
You could try playing around with the |
Beta Was this translation helpful? Give feedback.
-
That might be as a result of using
You could try playing around with the |
Beta Was this translation helpful? Give feedback.
-
Hi @MaartenGr! Thank you for the package. Really useful for an array of applications!
It would be great if you can suggest some directions with a few issues i encounterd:
When I apply outlier reduction, the topic embedding array does not match in dimensions with the new topic data frame: topic_model.get_topic_info() . Of course it reduces the noise topic. I am not sure though now that the topic embeddings match the topic representations. Is it as simple as ignoring the first value in the array?
I use the model to extract topics from around 300.000 documents. I should have in principle around 5%-10% outliers. But if i don't apply any outlier reduction i get almost half of my documents being in the noise cluster based on the model below. Do you have any suggestion what to do in this case?
Model:
sentence_model = SentenceTransformer(MODEL)
umap_model_chosen = UMAP(n_neighbors=10, min_dist=0.05, metric='cosine',random_state=42)
hdbscan_model_chosen = HDBSCAN(min_cluster_size=10, metric='euclidean', gen_min_span_tree = True,
cluster_selection_method='eom', prediction_data=True, min_samples=5)
vectorizer_model = CountVectorizer(stop_words="english", lowercase=True, ngram_range=(1, 2), min_df=2)
ctfidf_model_chosen = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model_chosen = MaximalMarginalRelevance(diversity=0.9)
topic_model = BERTopic(umap_model = umap_model_chosen,
hdbscan_model = hdbscan_model_chosen,
vectorizer_model = vectorizer_model,
calculate_probabilities = False, # set it to False to speed up computation time
nr_topics = 654,
language ="english",
# embedding_model = sentence_model,
ctfidf_model = ctfidf_model_chosen,
representation_model = representation_model_chosen,
top_n_words = 50,
min_topic_size = 100)
topics_calc, probs = topic_model.fit_transform(abstracts_text, abstract_embeddings)
Reduce outliers using the
c-tf-idf
strategynew_topics = topic_model.reduce_outliers(abstracts_text, topics_calc)
new_topics = topic_model.reduce_outliers(abstracts_text, topics_calc, strategy="c-tf-idf")
embedd = topic_model.topic_embeddings_
Beta Was this translation helpful? Give feedback.
All reactions