Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed issue #2144 #2180

Closed
wants to merge 1 commit into from
Closed

Fixed issue #2144 #2180

wants to merge 1 commit into from

Conversation

PipaFlores
Copy link
Contributor

@PipaFlores PipaFlores commented Oct 14, 2024

What does this PR do?

Fixes #2144 (issue) by optimizing the topic extraction process when using fit_transform() with nr_topics="auto" or int for reducing topics. The main improvements are:

  1. Avoiding double calculation of topic representations, especially when using LLM-based representations.
  2. Introducing a calculate_representation parameter in the _extract_topics method to control when representations are calculated.
  3. Modifying the fit_transform method to calculate representations only after topic reduction (if needed).

These changes significantly improve performance, particularly when using computationally expensive representation models like LLMs.

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.

This is easily solvable by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. But runs into problems when pytest tests\test_representation\test_representations.py tests for the edge case self.nr_topics >= initial_nr_topics
Thus, some fix in how this edge cases are tested is required.

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
    • Extra log messages were added to improve debugging
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?
    • Test is successful as it is now. However, they do not account for the issue described above

@PipaFlores
Copy link
Contributor Author

Ok, so I solved the issue with the test_representations.py::test_topic_reduction_edge_cases

It just needed one line of code to old_documents = topic_model._sort_mappings_by_frequency(old_documents)
Therefore, the test will actually check for ordered, and unchanged, old/new topics

I hope is enough and self-explanatory. I forced pushed into the same commit to keep git history clean

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.

One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;

            else:
                logger.info(
                    f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
                )
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents

Comment on lines +3989 to +3993
action = "Representation" if calculate_representation else "Topics"
logger.info(
f"{action} - Extracting topics from clusters{' using representation models' if calculate_representation else ''}."
)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I understood this correctly, are the two options as follows:

"Representation - Extracting topics from clusters using representation models"

Does this refer to the case where all representations are calculated (and not just c-TF-IDF)?

"Topics - Extracting topics from clusters"

Does this then refer to the case where only c-TF-IDF is calculate?

Also, perhaps it would be nice to create a variable for the second if/else statement the same way you did for the first.

method = ' using representation models' if calculate_representation else ''
logger.info(f"{action} - Extracting topics from clusters{method}."

Copy link
Contributor Author

@PipaFlores PipaFlores Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the "Topics - Extracting topics from clusters" refers to the case where only the c-TF-IDF is calculated.
And I would suggest changing it to "Extracting base representation from bag of words"

I can make the change suggested for the if else statement, but please check my comment below.

@@ -3980,18 +3980,29 @@ def _extract_topics(
embeddings: The document embeddings
mappings: The mappings from topic to word
verbose: Whether to log the process of extracting topics
calculate_representation: Whether to extract the topic representations
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the "additional" topic representations instead since c-TF-IDF is already calculated? I'm struggling a bit with having calculated c-TF-IDF before or not. Now it seems like c-TF-IDF is not calculated at all.

Copy link
Contributor Author

@PipaFlores PipaFlores Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be the "additional" representation

The c-TF-IDF is always calculated, since is not conditional to that boolean argument.
Is calculated in line 3995 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic) and then used to calculate the default representation in def _extract_words_per_topic
if not self.representation_model or not calculate_representation: # Default representation: c_tf_idf + top_n_words topics = {label: values[: self.top_n_words] for label, values in topics.items()}

words,
documents,
calculate_representation=calculate_representation,
calculate_aspects=calculate_representation,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I understand this correctly. You added this here because when you only calculate the base representation (calculate_representation=False) you do not want the additional representations (calculate_aspects=False), right?

If so, we might need to think about the term calculate_representation because it should refer to fine-tuning representations instead right? Or did you name it did way because technically, it might also be the default representation when you set it as "Main"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats right.

Indeed is about fine tuning. But yea, the nomenclature here is confusing for me as well.

It could be changed to "fine_tune_representations"

I will gather my thoughts on one single comment bellow, since this touches on the other comments.

@@ -4468,7 +4488,7 @@ def _auto_reduce_topics(self, documents: pd.DataFrame, use_ctfidf: bool = False)
# Update documents and topics
self.topic_mapper_.add_mappings(mapped_topics, topic_model=self)
documents = self._sort_mappings_by_frequency(documents)
self._extract_topics(documents, mappings=mappings)
self._extract_topics(documents, mappings=mappings, verbose=self.verbose)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you added verbosity here. I purposefully left this to False to prevent too much logging but was wondering why you think it was necessary to add it.

@PipaFlores
Copy link
Contributor Author

PipaFlores commented Oct 21, 2024

Awesome PR! Very clean and focused on the most important components here. I left a couple of minor questions that have mostly to do with variable naming.

One last question is whether that edge case still appears that we talked about previously. Was it fully prevented by adding what you showed here;

            else:
                logger.info(
                    f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())})."
                )
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents

Yes, that solves the edge case. The issue I had before was that test_representation.py would raise a false negative error, but i got that fixed by editing the test.

Regarding the other comments:

I think there is a higher level confusion between the documented modularity and the current logging verbosity.

In the code, and logger calls, the pipeline is defined as:
Embeddings -> Dim. Red. -> Clustering -> Representation

However, in the documentation and paper, the pipeline is more detailed for the later steps as :
Embeddings -> Dim. Red. -> Clustering -> Bag of Words (vectorizer + c-TF-IDF) -> Fine Tuning Representations

I could suggest a modification for the logging, if you would allow it, and put it here or in other PR. However, this requires some conceptual clarifications:

  1. When are topics "defined": In my head, topics are defined when clustering, as documents are assigned a topic ID. However, it also possible to make the case that the topics are defined at the end of the c-TF-IDF procedure, when we have a definition (based on top-n-keywords) for each topic. However, I can also imagine that the c-TF-IDF is not defining the topic itself, but just defining its "representation", hence the current logging of "Representation - Extracting topics from clusters" might make sense.

Also the word "extract" is a bit ambiguous for me.

  1. What counts as a representation?
    Do you consider c-TF-IDF a representation model (in this case, the base representation, as it is always calculated)?
    Since its a core part of the pipeline and is it always calculated, I would assign this procedure to its own category and make a distance from "representations" (as it is described also in the docs). Currently, there is no logging or mention of cTFIDF or bag-of-words in the code when verbose=true, as the log just skips from:
    BERTopic - Cluster - Completed ✓
    BERTopic - Representation - Extracting topics from clusters using representation models.
    BERTopic - Representation - Completed ✓

I would propose aligning the logging to the documentation pipeline. As a rough example of a fit_transform with verbosity (ignoring the Completed ✓ for brevity):
Embeddings - Transforming documents to embeddings
Dim. Red. -> Fitting the dimensionality reduction algorithm
Clustering -> Clustering the reduced embeddings
Bag of Words -> Calculating c-TF-IDF base representation of topics
Fine Tuning Representation -> Fine-tuning using representation model.
Fine Tuning Representation -> Calculating different aspects.

As a non-CS researcher, I find it very helpful to understand and adapt the different steps of the implementation through the logging, that is also why I added the extra verbosity in self._extract_topics(documents, mappings=mappings, verbose=self.verbose)

Long story short, If what I reflected above make sense I can present some proposal on how to align the logger to the documented pipeline. This will probably help solve the other comments you made.

Otherwise, I can make small modifications for the specific comments you made, and come back with my logging suggestions in a new issue or PR

@PipaFlores PipaFlores closed this Oct 21, 2024
@PipaFlores
Copy link
Contributor Author

Ok I closed this PR because i found a much cleaner solution. I will make a new one, you will notice it has a minimal change that might not even require major review. I will open the logging discussion as a new issue, to distinguish the two issues and make my point better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Representations from representation_models are generated twice when nr_topics="auto".
2 participants