Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

SkylarOconnell · 2024-10-25T20:24:40Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

After training, I tried saving the model using both pytorch and safetensor. When I re-download the model, load the files into Bertopic using Bertopic.load(), and run inference using transform(), all the topics are coming out differently than the original fit results. Below are some examples the first topic and prob is from the original training/fit of the model and the second is from running transform():

Topic: 2 Probability: 0.9999999985560923 vs. Topic: 3 Probability: 0.9999477863311768

Topic: 1 Probability: 0.9993163446248252 vs. Topic: 2 Probability: 0.04614641437377926

Topic: 2 Probability: 1.0 vs. Topic: 3 Probability: 0.9591490626335144

One thing to note is that running transform over and over comes out with the same results that are different than the original training output. Also, when I run transform on the original model without saving it anywhere else, I get the same results as the original run. I was wondering if I am missing something with saving the model correctly. Below is the code I use to train, save, and run transform on the model. We also run reduce_outliers() before saving the model.

Reproduction

from bertopic import BERTopic


self.model_params = {
            'min_topic_size': int((len(rows) / 160) - 1),
            'calculate_probabilities': True,
            'verbose': True,
            'umap_model': UMAP(
                n_neighbors=50,
                n_components=20,
                metric='cosine',
                low_memory=False,
                random_state=42,
            )
        }

self.model = BERTopic(**self.model_params)

self.topics, self.probabilities = self.model.fit_transform(
            documents=self.docs,
            embeddings=numpy.array(self.embeddings),
            y=self.labels
        )

new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
        )

self.model.update_topics(self.docs, topics=new_topics)

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
self.model.save(torch_file_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)


new_model = BERTopic.load(artifact_path)

new_model_temp_topics, new_model_temp_probabilities = new_model.transform(documents=self.docs, embeddings=numpy.array(self.embeddings))

BERTopic Version

0.16.0

MaartenGr · 2024-10-29T14:43:20Z

You are using an older version of BERTopic and I remember that there were some fixes since then. Could you try it with the latest version instead? 0.16.4.

SkylarOconnell · 2024-10-29T15:10:43Z

Got it trying that now!

SkylarOconnell · 2024-10-29T15:22:59Z

Just tried increasing the version of Bertopic to 0.16.4 and still the same issue.

Initial Training:
Item 1: Topic: 3 Probability: 0.9999968824322634
Item 2: Topic: 2 Probability: 0.883032787750728
Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Transform without saving:
Item 1: Topic: 3 Probability: 0.9999968824322634
Item 2: Topic: 2 Probability: 0.883032787750728
Item 3: Topic: 4 Probability: 0.9902231709346468

Inference/Training after saving and redownloading using safetensor:
Item 1: Topic: 4 Probability: 0.9999788403511047
Item 2: Topic: 3 Probability: 0.9999911785125732
Item 3: Topic: 5 Probability: 0.999993085861206

All topics (except outliers) are coming out one more than the original run or the original model without saving it

SkylarOconnell · 2024-10-29T15:26:13Z

@MaartenGr I also just tried saving with pytorch as well and got the same issue

MaartenGr · 2024-11-04T13:25:44Z

Hmnmmm, this is quite unexpected. I'm a bit baffled here considering these probabilities are extremely high.

My guess would be that there is something going wrong with reducing outliers before updating and then saving the model. What would happen if you didn't reduce outliers?

SkylarOconnell · 2024-11-04T18:13:00Z

@MaartenGr Removing reduce outliers fixes the issue and now I am getting the same results between the initial training and the inference run after downloading. Is there a way to keep reduce outliers or is this a bug that would need to be fixed first?

MaartenGr · 2024-11-05T11:03:03Z

@SkylarOconnell I'm not actually sure why this is happening. It could be that by reducing outliers so much, it distorts the newly created topic embeddings (topic_model.topic_embeddings_). You could choose to save the topic embeddings before outlier reduction, and then re-assign them after reducing outliers.

SkylarOconnell · 2024-11-06T15:15:01Z

@MaartenGr Could you provide an example for this? I'm not really sure how to do that.

MaartenGr · 2024-11-08T07:30:36Z

@SkylarOconnell Sure!

# Track topic embeddings before reducing outliers
topic_embeddings = topic_model.topic_embeddings_

# Reduce outliers and update topics
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
topic_model.topic_embeddings_ = topic_embeddings

When doing this, make sure whether the old topic embeddings are correctly assigned as I'm not sure whether this creates a shallow or deep copy.

SkylarOconnell · 2024-11-11T15:10:47Z

@MaartenGr Sorry for the delayed response.

When I add in the code above (changing topic_model to self.model since we are using class variables), it goes back to the original issue. Could it be an issue/bug between reduce_outliers and pytorch/safetensor? Reduced outliers works and the transform works until I save with those and redownload.

topic_embeddings = self.model.topic_embeddings_
new_topics = self.model.reduce_outliers(
            self.docs,
            self.topics,
            probabilities=self.probabilities,
            strategy='probabilities'
)

self.model.update_topics(self.docs, topics=new_topics)

# Reassign old topic embeddings
self.model.topic_embeddings_ = topic_embeddings

MaartenGr · 2024-11-12T09:18:25Z

I'm not sure if I understand correctly. Just to make sure:

You double checked that the self.model.topic_embeddings_ now has the old topic embeddings right? So we can be sure that the old topic embeddings are kept.
If so, you get the same issue as before right? The one where topics do not match up? Could you check how many do not match up? It is not uncommon that only 70% or so matches up since it is a different procedure.
Lastly, do you have a fully reproducible example that I can use along with data? Otherwise, it's difficult for me to debug this without more info.

SkylarOconnell · 2024-11-12T12:48:44Z

I will double check the top bullet and let you know. If the topic_embeddings are the same as the old embeddings, I will run a quick count to see how many are off. I'll respond here once I am able to do so.

SkylarOconnell added the bug Something isn't working label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

SkylarOconnell commented Oct 25, 2024 •

edited by MaartenGr

Loading

MaartenGr commented Oct 29, 2024

SkylarOconnell commented Oct 29, 2024

SkylarOconnell commented Oct 29, 2024 •

edited

Loading

SkylarOconnell commented Oct 29, 2024 •

edited

Loading

MaartenGr commented Nov 4, 2024

SkylarOconnell commented Nov 4, 2024

MaartenGr commented Nov 5, 2024

SkylarOconnell commented Nov 6, 2024

MaartenGr commented Nov 8, 2024

SkylarOconnell commented Nov 11, 2024 •

edited

Loading

MaartenGr commented Nov 12, 2024

SkylarOconnell commented Nov 12, 2024

Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198

Comments

SkylarOconnell commented Oct 25, 2024 • edited by MaartenGr Loading

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

BERTopic Version

MaartenGr commented Oct 29, 2024

SkylarOconnell commented Oct 29, 2024

SkylarOconnell commented Oct 29, 2024 • edited Loading

SkylarOconnell commented Oct 29, 2024 • edited Loading

MaartenGr commented Nov 4, 2024

SkylarOconnell commented Nov 4, 2024

MaartenGr commented Nov 5, 2024

SkylarOconnell commented Nov 6, 2024

MaartenGr commented Nov 8, 2024

SkylarOconnell commented Nov 11, 2024 • edited Loading

MaartenGr commented Nov 12, 2024

SkylarOconnell commented Nov 12, 2024

SkylarOconnell commented Oct 25, 2024 •

edited by MaartenGr

Loading

SkylarOconnell commented Oct 29, 2024 •

edited

Loading

SkylarOconnell commented Oct 29, 2024 •

edited

Loading

SkylarOconnell commented Nov 11, 2024 •

edited

Loading