topics over time's problem #2064

zzzxiaoyu · 2024-06-23T04:37:14Z

zzzxiaoyu
Jun 23, 2024

when i use topics_over_time(),it comes to the error
topics_over_time = ab_topic_model.topics_over_time(ab_list, time_set)

ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 topics_over_time = ab_topic_model.topics_over_time(ab_list, time_set)

File d:\anaconda3\envs\bertopic\Lib\site-packages\bertopic_bertopic.py:799, in BERTopic.topics_over_time(self, docs, timestamps, topics, nr_bins, datetime_format, evolution_tuning, global_tuning)
796 selection = documents.loc[documents.Timestamps == timestamp, :]
797 documents_per_topic = selection.groupby(['Topic'], as_index=False).agg({'Document': ' '.join,
798 "Timestamps": "count"})
--> 799 c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)
801 if global_tuning or evolution_tuning:
802 c_tf_idf = normalize(c_tf_idf, axis=1, norm='l1', copy=False)

File d:\anaconda3\envs\bertopic\Lib\site-packages\bertopic_bertopic.py:3861, in BERTopic._c_tf_idf(self, documents_per_topic, fit, partial_fit)
3858 if fit:
3859 self.ctfidf_model = self.ctfidf_model.fit(X, multiplier=multiplier)
-> 3861 c_tf_idf = self.ctfidf_model.transform(X)
3863 return c_tf_idf, words

File d:\anaconda3\envs\bertopic\Lib\site-packages\sklearn\utils_set_output.py:295, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
293 @wraps(f)
294 def wrapped(self, X, *args, **kwargs):
--> 295 data_to_wrap = f(self, X, *args, **kwargs)
296 if isinstance(data_to_wrap, tuple):
297 # only wrap the first output for cross decomposition
...
1076 )
1078 if ensure_min_features > 0 and array.ndim == 2:
1079 n_features = array.shape[1]

ValueError: Found array with 0 sample(s) (shape=(0, 31883)) while a minimum of 1 is required by the normalize function.

Below is my training code

embedding_model = SentenceTransformer("\sentence_transformer\all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstract, show_progress_bar=True)

降维

umap_model = UMAP(n_neighbors=12, n_components=5, min_dist=0.0, metric='cosine', random_state=52)

聚类

hdbscan_model = HDBSCAN(min_cluster_size=60,min_samples=10,cluster_selection_epsilon=0, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

改进主题表达

vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

ctfidf_model = ClassTfidfTransformer()
keybert_model = KeyBERTInspired()
representation_model = {"KeyBERT": keybert_model}
topic_model = BERTopic(

# Pipeline models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
calculate_probabilities=True,

# Hyperparameters
top_n_words=10,
#nr_topics=10,
min_topic_size=10,
verbose=True

)

Train model

topics, probs = topic_model.fit_transform(abstract, embeddings)
topic_model.save("power_battery/model_saved/abstract_topic_model", serialization="safetensors", save_ctfidf=True,save_embedding_model=embedding_model)

MaartenGr · 2024-06-23T06:20:01Z

MaartenGr
Jun 23, 2024
Maintainer

topics_over_time = ab_topic_model.topics_over_time(ab_list, time_set)

It's not clear from your code but are you running this after saving and loading the model? Also, are the ab_list different from the abstracts? Please make sure to post your full code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topics over time's problem #2064

{{title}}

Replies: 1 comment

{{title}}

Select a reply

topics over time's problem #2064

zzzxiaoyu Jun 23, 2024

降维

聚类

改进主题表达

Train model

Replies: 1 comment

MaartenGr Jun 23, 2024 Maintainer

zzzxiaoyu
Jun 23, 2024

MaartenGr
Jun 23, 2024
Maintainer