Fixed issue #2144 #2180

PipaFlores · 2024-10-14T14:03:46Z

What does this PR do?

Fixes #2144 (issue) by optimizing the topic extraction process when using fit_transform() with nr_topics="auto" or int for reducing topics. The main improvements are:

Avoiding double calculation of topic representations, especially when using LLM-based representations.
Introducing a calculate_representation parameter in the _extract_topics method to control when representations are calculated.
Modifying the fit_transform method to calculate representations only after topic reduction (if needed).

These changes significantly improve performance, particularly when using computationally expensive representation models like LLMs.

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.

This is easily solvable by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. But runs into problems when pytest tests\test_representation\test_representations.py tests for the edge case self.nr_topics >= initial_nr_topics
Thus, some fix in how this edge cases are tested is required.

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
- Extra log messages were added to improve debugging
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?
- Test is successful as it is now. However, they do not account for the issue described above

PipaFlores · 2024-10-15T09:01:14Z

Ok, so I solved the issue with the test_representations.py::test_topic_reduction_edge_cases

It just needed one line of code to old_documents = topic_model._sort_mappings_by_frequency(old_documents)
Therefore, the test will actually check for ordered, and unchanged, old/new topics

I hope is enough and self-explanatory. I forced pushed into the same commit to keep git history clean

MaartenGr

Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.

One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;

            else:
                logger.info(
                    f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
                )
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents

MaartenGr · 2024-10-18T10:07:02Z

bertopic/_bertopic.py

+            action = "Representation" if calculate_representation else "Topics"
+            logger.info(
+                f"{action} - Extracting topics from clusters{' using representation models' if calculate_representation else ''}."
+            )
+


Just to be sure I understood this correctly, are the two options as follows:

"Representation - Extracting topics from clusters using representation models"

Does this refer to the case where all representations are calculated (and not just c-TF-IDF)?

"Topics - Extracting topics from clusters"

Does this then refer to the case where only c-TF-IDF is calculate?

Also, perhaps it would be nice to create a variable for the second if/else statement the same way you did for the first.

method = ' using representation models' if calculate_representation else '' logger.info(f"{action} - Extracting topics from clusters{method}."

Indeed, the "Topics - Extracting topics from clusters" refers to the case where only the c-TF-IDF is calculated.
And I would suggest changing it to "Extracting base representation from bag of words"

I can make the change suggested for the if else statement, but please check my comment below.

MaartenGr · 2024-10-18T10:09:35Z

bertopic/_bertopic.py

@@ -3980,18 +3980,29 @@ def _extract_topics(
            embeddings: The document embeddings
            mappings: The mappings from topic to word
            verbose: Whether to log the process of extracting topics
+            calculate_representation: Whether to extract the topic representations


Shouldn't this be the "additional" topic representations instead since c-TF-IDF is already calculated? I'm struggling a bit with having calculated c-TF-IDF before or not. Now it seems like c-TF-IDF is not calculated at all.

Yes, it should be the "additional" representation

The c-TF-IDF is always calculated, since is not conditional to that boolean argument.
Is calculated in line 3995 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic) and then used to calculate the default representation in def _extract_words_per_topic
if not self.representation_model or not calculate_representation: # Default representation: c_tf_idf + top_n_words topics = {label: values[: self.top_n_words] for label, values in topics.items()}

MaartenGr · 2024-10-18T10:14:25Z

bertopic/_bertopic.py

+            words,
+            documents,
+            calculate_representation=calculate_representation,
+            calculate_aspects=calculate_representation,


Just to be sure I understand this correctly. You added this here because when you only calculate the base representation (calculate_representation=False) you do not want the additional representations (calculate_aspects=False), right?

If so, we might need to think about the term calculate_representation because it should refer to fine-tuning representations instead right? Or did you name it did way because technically, it might also be the default representation when you set it as "Main"?

Thats right.

Indeed is about fine tuning. But yea, the nomenclature here is confusing for me as well.

It could be changed to "fine_tune_representations"

I will gather my thoughts on one single comment bellow, since this touches on the other comments.

MaartenGr · 2024-10-18T10:16:26Z

bertopic/_bertopic.py

@@ -4468,7 +4488,7 @@ def _auto_reduce_topics(self, documents: pd.DataFrame, use_ctfidf: bool = False)
        # Update documents and topics
        self.topic_mapper_.add_mappings(mapped_topics, topic_model=self)
        documents = self._sort_mappings_by_frequency(documents)
-        self._extract_topics(documents, mappings=mappings)
+        self._extract_topics(documents, mappings=mappings, verbose=self.verbose)


I see you added verbosity here. I purposefully left this to False to prevent too much logging but was wondering why you think it was necessary to add it.

PipaFlores · 2024-10-21T08:22:26Z

Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.

One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;
            else:
                logger.info(
                    f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
                )
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents

Yes, that solves the edge case. The issue I had before was that test_representation.py would raise a false negative error, but i got that fixed by editing the test.

Regarding the other comments:

I think there is a higher level confusion between the documented modularity and the current logging verbosity.

In the code, and logger calls, the pipeline is defined as:
Embeddings -> Dim. Red. -> Clustering -> Representation

However, in the documentation and paper, the pipeline is more detailed for the later steps as :
Embeddings -> Dim. Red. -> Clustering -> Bag of Words (vectorizer + c-TF-IDF) -> Fine Tuning Representations

I could suggest a modification for the logging, if you would allow it, and put it here or in other PR. However, this requires some conceptual clarifications:

When are topics "defined": In my head, topics are defined when clustering, as documents are assigned a topic ID. However, it also possible to make the case that the topics are defined at the end of the c-TF-IDF procedure, when we have a definition (based on top-n-keywords) for each topic. However, I can also imagine that the c-TF-IDF is not defining the topic itself, but just defining its "representation", hence the current logging of "Representation - Extracting topics from clusters" might make sense.

Also the word "extract" is a bit ambiguous for me.

What counts as a representation?
Do you consider c-TF-IDF a representation model (in this case, the base representation, as it is always calculated)?
Since its a core part of the pipeline and is it always calculated, I would assign this procedure to its own category and make a distance from "representations" (as it is described also in the docs). Currently, there is no logging or mention of cTFIDF or bag-of-words in the code when verbose=true, as the log just skips from:
BERTopic - Cluster - Completed ✓
BERTopic - Representation - Extracting topics from clusters using representation models.
BERTopic - Representation - Completed ✓

I would propose aligning the logging to the documentation pipeline. As a rough example of a fit_transform with verbosity (ignoring the Completed ✓ for brevity):
Embeddings - Transforming documents to embeddings
Dim. Red. -> Fitting the dimensionality reduction algorithm
Clustering -> Clustering the reduced embeddings
Bag of Words -> Calculating c-TF-IDF base representation of topics
Fine Tuning Representation -> Fine-tuning using representation model.
Fine Tuning Representation -> Calculating different aspects.

As a non-CS researcher, I find it very helpful to understand and adapt the different steps of the implementation through the logging, that is also why I added the extra verbosity in self._extract_topics(documents, mappings=mappings, verbose=self.verbose)

Long story short, If what I reflected above make sense I can present some proposal on how to align the logger to the documented pipeline. This will probably help solve the other comments you made.

Otherwise, I can make small modifications for the specific comments you made, and come back with my logging suggestions in a new issue or PR

PipaFlores · 2024-10-21T11:16:41Z

Ok I closed this PR because i found a much cleaner solution. I will make a new one, you will notice it has a minimal change that might not even require major review. I will open the logging discussion as a new issue, to distinguish the two issues and make my point better.

PipaFlores force-pushed the master branch from bfc68f1 to 9a0d261 Compare October 15, 2024 08:23

Fixed issue MaartenGr#2144

b006e46

PipaFlores force-pushed the master branch from 9a0d261 to b006e46 Compare October 15, 2024 08:56

MaartenGr reviewed Oct 18, 2024

View reviewed changes

PipaFlores closed this Oct 21, 2024

This was referenced Oct 21, 2024

Align Logger Output with Documented Pipeline #2192

Open

Fixed issue #2144 #2191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed issue #2144 #2180

Fixed issue #2144 #2180

PipaFlores commented Oct 14, 2024 •

edited

Loading

PipaFlores commented Oct 15, 2024

MaartenGr left a comment

MaartenGr Oct 18, 2024

PipaFlores Oct 21, 2024 •

edited

Loading

MaartenGr Oct 18, 2024

PipaFlores Oct 21, 2024 •

edited

Loading

MaartenGr Oct 18, 2024

PipaFlores Oct 21, 2024

MaartenGr Oct 18, 2024

PipaFlores commented Oct 21, 2024 •

edited

Loading

PipaFlores commented Oct 21, 2024

Fixed issue #2144 #2180

Fixed issue #2144 #2180

Conversation

PipaFlores commented Oct 14, 2024 • edited Loading

What does this PR do?

Before submitting

PipaFlores commented Oct 15, 2024

MaartenGr left a comment

Choose a reason for hiding this comment

MaartenGr Oct 18, 2024

Choose a reason for hiding this comment

PipaFlores Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

MaartenGr Oct 18, 2024

Choose a reason for hiding this comment

PipaFlores Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

MaartenGr Oct 18, 2024

Choose a reason for hiding this comment

PipaFlores Oct 21, 2024

Choose a reason for hiding this comment

MaartenGr Oct 18, 2024

Choose a reason for hiding this comment

PipaFlores commented Oct 21, 2024 • edited Loading

PipaFlores commented Oct 21, 2024

PipaFlores commented Oct 14, 2024 •

edited

Loading

PipaFlores Oct 21, 2024 •

edited

Loading

PipaFlores Oct 21, 2024 •

edited

Loading

PipaFlores commented Oct 21, 2024 •

edited

Loading