Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

Open
1 task done
SkylarOconnell opened this issue Oct 25, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@SkylarOconnell
Copy link

SkylarOconnell commented Oct 25, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

After training, I tried saving the model using both pytorch and safetensor. When I re-download the model, load the files into Bertopic using Bertopic.load(), and run inference using transform(), all the topics are coming out differently than the original fit results. Below are some examples the first topic and prob is from the original training/fit of the model and the second is from running transform():

Topic: 2 Probability: 0.9999999985560923 vs. Topic: 3 Probability: 0.9999477863311768

Topic: 1 Probability: 0.9993163446248252 vs. Topic: 2 Probability: 0.04614641437377926

Topic: 2 Probability: 1.0 vs. Topic: 3 Probability: 0.9591490626335144

One thing to note is that running transform over and over comes out with the same results that are different than the original training output. Also, when I run transform on the original model without saving it anywhere else, I get the same results as the original run. I was wondering if I am missing something with saving the model correctly. Below is the code I use to train, save, and run transform on the model. We also run reduce_outliers() before saving the model.

Reproduction

from bertopic import BERTopic


self.model_params = {
            'min_topic_size': int((len(rows) / 160) - 1),
            'calculate_probabilities': True,
            'verbose': True,
            'umap_model': UMAP(
                n_neighbors=50,
                n_components=20,
                metric='cosine',
                low_memory=False,
                random_state=42,
            )
        }

self.model = BERTopic(**self.model_params)

self.topics, self.probabilities = self.model.fit_transform(
            documents=self.docs,
            embeddings=numpy.array(self.embeddings),
            y=self.labels
        )

new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
        )

self.model.update_topics(self.docs, topics=new_topics)

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
self.model.save(torch_file_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)


new_model = BERTopic.load(artifact_path)

new_model_temp_topics, new_model_temp_probabilities = new_model.transform(documents=self.docs, embeddings=numpy.array(self.embeddings))

BERTopic Version

0.16.0

@SkylarOconnell SkylarOconnell added the bug Something isn't working label Oct 25, 2024
@MaartenGr
Copy link
Owner

You are using an older version of BERTopic and I remember that there were some fixes since then. Could you try it with the latest version instead? 0.16.4.

@SkylarOconnell
Copy link
Author

Got it trying that now!

@SkylarOconnell
Copy link
Author

SkylarOconnell commented Oct 29, 2024

Just tried increasing the version of Bertopic to 0.16.4 and still the same issue.

Initial Training:
Item 1: Topic: 3 Probability: 0.9999968824322634
Item 2: Topic: 2 Probability: 0.883032787750728
Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Transform without saving:
Item 1: Topic: 3 Probability: 0.9999968824322634
Item 2: Topic: 2 Probability: 0.883032787750728
Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Training after saving and redownloading using safetensor:
Item 1: Topic: 4 Probability: 0.9999788403511047
Item 2: Topic: 3 Probability: 0.9999911785125732
Item 3: Topic: 5 Probability: 0.999993085861206

All topics (except outliers) are coming out one more than the original run or the original model without saving it

@SkylarOconnell
Copy link
Author

SkylarOconnell commented Oct 29, 2024

@MaartenGr I also just tried saving with pytorch as well and got the same issue

@MaartenGr
Copy link
Owner

Hmnmmm, this is quite unexpected. I'm a bit baffled here considering these probabilities are extremely high.

My guess would be that there is something going wrong with reducing outliers before updating and then saving the model. What would happen if you didn't reduce outliers?

@SkylarOconnell
Copy link
Author

@MaartenGr Removing reduce outliers fixes the issue and now I am getting the same results between the initial training and the inference run after downloading. Is there a way to keep reduce outliers or is this a bug that would need to be fixed first?

@MaartenGr
Copy link
Owner

@SkylarOconnell I'm not actually sure why this is happening. It could be that by reducing outliers so much, it distorts the newly created topic embeddings (topic_model.topic_embeddings_). You could choose to save the topic embeddings before outlier reduction, and then re-assign them after reducing outliers.

@SkylarOconnell
Copy link
Author

@MaartenGr Could you provide an example for this? I'm not really sure how to do that.

@MaartenGr
Copy link
Owner

@SkylarOconnell Sure!

# Track topic embeddings before reducing outliers
topic_embeddings = topic_model.topic_embeddings_

# Reduce outliers and update topics
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
topic_model.topic_embeddings_ = topic_embeddings

When doing this, make sure whether the old topic embeddings are correctly assigned as I'm not sure whether this creates a shallow or deep copy.

@SkylarOconnell
Copy link
Author

SkylarOconnell commented Nov 11, 2024

@MaartenGr Sorry for the delayed response.

When I add in the code above (changing topic_model to self.model since we are using class variables), it goes back to the original issue. Could it be an issue/bug between reduce_outliers and pytorch/safetensor? Reduced outliers works and the transform works until I save with those and redownload.

topic_embeddings = self.model.topic_embeddings_
new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
)

self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
self.model.topic_embeddings_ = topic_embeddings

@MaartenGr
Copy link
Owner

I'm not sure if I understand correctly. Just to make sure:

  • You double checked that the self.model.topic_embeddings_ now has the old topic embeddings right? So we can be sure that the old topic embeddings are kept.
  • If so, you get the same issue as before right? The one where topics do not match up? Could you check how many do not match up? It is not uncommon that only 70% or so matches up since it is a different procedure.
  • Lastly, do you have a fully reproducible example that I can use along with data? Otherwise, it's difficult for me to debug this without more info.

@SkylarOconnell
Copy link
Author

I will double check the top bullet and let you know. If the topic_embeddings are the same as the old embeddings, I will run a quick count to see how many are off. I'll respond here once I am able to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants