-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #2102 #2105
Conversation
# No need to sort if it's the first pass of zero-shot topic modeling | ||
nr_zeroshot = len(self._topic_id_to_zeroshot_topic_idx) | ||
if self._is_zeroshot and not self.nr_topics and nr_zeroshot > 0: | ||
return documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the plan was to update self._topic_id_to_zeroshot_topic_idx
in ._sort_mappings_by_frequency
if zeroshot, and then not use self.topic_mapper_.get_mappings()
in .topic_labels_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, I wasn't familiar enough with self._topic_id_to_zeroshot_topic_idx
so wasn't sure the best way to change it. This seemed faster and is working now but if you have suggestions on how to change it, feel free to share.
@@ -467,7 +469,6 @@ def fit_transform( | |||
# All documents matches zero-shot topics | |||
documents = assigned_documents | |||
embeddings = assigned_embeddings | |||
topics_before_reduction = self.topics_ | |||
|
|||
# Sort and Map Topic IDs by their frequency | |||
if not self.nr_topics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does sorting mappings not occur if nr_topics
is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topics_before_reduction
were used for calculating the probabilities and after I made a couple of changes, it started giving index errors when attempting to access the similarity matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, if I add nr_topics="auto"
when training BERTopic using the latest commit in this PR, I get the following error:
from bertopic.representation import KeyBERTInspired
# KeyBERT
keybert_model = KeyBERTInspired()
# Pass the above models to be used in BERTopic
topic_model = UpdatedBERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
verbose=True,
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.5,
nr_topics="auto",
calculate_probabilities=True,
representation_model=keybert_model
)
topic_model = topic_model.fit(abstracts, embeddings)
Then that gives me the following error:
IndexError Traceback (most recent call last)
Cell In[13], line 18
6 # Pass the above models to be used in BERTopic
7 topic_model = UpdatedBERTopic(
8 embedding_model=embedding_model,
9 umap_model=umap_model,
(...)
16 representation_model=keybert_model
17 )
---> 18 topic_model = topic_model.fit(abstracts, embeddings)
File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:364](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=363), in BERTopic.fit(self, documents, embeddings, images, y)
322 def fit(
323 self,
324 documents: List[str],
(...)
327 y: Union[List[int], np.ndarray] = None,
328 ):
329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
330
331 Arguments:
(...)
362 ```
363 """
--> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
365 return self
Cell In[12], line 300, in UpdatedBERTopic.fit_transform(self, documents, embeddings, images, y)
292 else:
293 # Use `topics_before_reduction` because `self.topics_` may have already been updated from
294 # reducing topics, and the original probabilities are needed for `self._map_probabilities()`
295 probabilities = sim_matrix[
296 np.arange(len(documents)),
297 np.array(topics_before_reduction) + self._outliers,
298 ]
--> 300 self.probabilities_ = self._map_probabilities(probabilities, original_topics=True)
301 predictions = documents.Topic.to_list()
303 return predictions, self.probabilities_
File [~\Documents\Projects\BERTopic\bertopic\_bertopic.py:4581](http://localhost:8888/lab/tree/Documents/Projects/~/Documents/Projects/BERTopic/bertopic/_bertopic.py#line=4580), in _map_probabilities(self, probabilities, original_topics)
4578 if to_topic != -1 and from_topic != -1:
4579 mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
-> 4581 return mapped_probabilities
4583 return probabilities
IndexError: index 71 is out of bounds for axis 1 with size 71
@ianrandman If you have the time, could you take a look at the open comments? If we can resolve them, I can start putting out a new minor version with the fix. |
Sorry about the delay. I can get back to you within a day with more detailed comments. Some quick comments I can give now are that it looks like your solution uses my temp fix for the topic labels property rather than changing the zeroshot idx dict as topic mapping changes. Using the temp fix results in incorrectness (I think) with topic reduction because the implemtation there is to change the zeroshot idx dict as the topic mapping changed. I think that anytime the zeroshot idx dict is looked at, it should be correct without needing to apply the mapping from the topic mapper. As far as I know, all that is needed to achieve that, given the current implementation, is mentioned in #2105 (comment). Let me know your thoughts on this and whether I (hopefully don't) need to provide more detail. |
@ianrandman Thank you for the quick reply!
Hmm, I tested the topic reduction a couple of times and although it works it indeed does merge the zero-shot topics.
I've been looking through the zeroshot idx dict but I'm just not sure where and how to update it for your proposed fix. It seems that I do not have a sufficient enough grasp on the logic here to make the changes. |
What does this PR do?
Fixes #2102
This fixes a number of things:
ValueError
whennr_topics="auto"
is used combined with zero-shot topic modelingTopicMapper
at the right moment so that all topics are added to the mapper and not only a subsetI updated the way probabilities were returned as I faced some issues with selecting the correct topics. I might change it back after more testing.
Before submitting