Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'topics_from' #2100

Closed
1 task done
KeeratKG opened this issue Jul 26, 2024 · 19 comments · Fixed by #2101
Closed
1 task done

KeyError: 'topics_from' #2100

KeeratKG opened this issue Jul 26, 2024 · 19 comments · Fixed by #2101
Labels
bug Something isn't working

Comments

@KeeratKG
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

When trying to run topics, probs = TM.fit_transform(docs) where docs is a list of strings (we want to cluster topics based on these strings), I run into the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 1
----> 1 topics, probs = TM.fit_transform(docs)

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    494 # Reduce topics
    495 if self.nr_topics:
--> 496     documents = self._reduce_topics(documents)
    498 # Save the top 3 most representative documents per topic
    499 self._save_representative_docs(documents)

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
   4345         documents = self._reduce_to_n_topics(documents, use_ctfidf)
   4346 elif isinstance(self.nr_topics, str):
-> 4347     documents = self._auto_reduce_topics(documents, use_ctfidf)
   4348 else:
   4349     raise ValueError("nr_topics needs to be an int or 'auto'! ")

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf)
   4500 self.topic_mapper_.add_mappings(mapped_topics)
   4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
   4503 self._update_topic_size(documents)
   4504 return documents

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3983 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3984 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3986 if verbose:
   3987     logger.info("Representation - Completed \u2713")

File /usr/local/lib/python3.9/site-packages/bertopic/_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
   4119 topic_embeddings_dict = {}
   4120 for topic_to, topics_from in mappings.items():
-> 4121     topic_ids = topics_from["topics_from"]
   4122     topic_sizes = topics_from["topic_sizes"]
   4123     if topic_ids:

KeyError: 'topics_from'

This happens after the following steps of training have already taken place:

2024-07-26 18:43:39,195 - BERTopic - Embedding - Transforming documents to embeddings.
Error displaying widget: model not found
2024-07-26 18:43:55,125 - BERTopic - Embedding - Completed ✓
2024-07-26 18:43:55,126 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-07-26 18:44:21,848 - BERTopic - Dimensionality - Completed ✓
2024-07-26 18:44:21,849 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-07-26 18:44:40,617 - BERTopic - Cluster - Completed ✓
2024-07-26 18:44:40,618 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-07-26 18:45:04,160 - BERTopic - Representation - Completed ✓
2024-07-26 18:45:04,171 - BERTopic - Topic reduction - Reducing number of topics

Reproduction

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from umap import UMAP
from hdbscan import HDBSCAN
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from collections import Counter

class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

stopwords = list(stopwords.words('english'))


SENT_EMBEDDING = SentenceTransformer('all-MiniLM-L6-v2')
UMAP_MODEL = UMAP(n_neighbors=15, n_components=3, min_dist=0.05)
HDBSCAN_MODEL = HDBSCAN(min_cluster_size=15, prediction_data=True, gen_min_span_tree=True)
VECTORIZE_MODEL = CountVectorizer(ngram_range=(1,3), stop_words=stopwords, tokenizer=LemmaTokenizer())
ctfidf_model = ClassTfidfTransformer()
representation_model = MaximalMarginalRelevance(diversity=0.2)

TM = BERTopic(
umap_model=UMAP_MODEL,
hdbscan_model=HDBSCAN_MODEL,
embedding_model=SENT_EMBEDDING,
vectorizer_model=VECTORIZE_MODEL,
ctfidf_model=ctfidf_model,
representation_model=representation_model,
language='english',
calculate_probabilities=True,
verbose=True,
nr_topics = 'auto')

docs = ["The weather today is amazing", "It is quite unbearably hot today", "Oh this ice cream looks lovely", "Where are you?", "How are you?"] ## sample only 

topics, probs = TM.fit_transform(docs)

BERTopic Version

0.16.13

@KeeratKG KeeratKG added the bug Something isn't working label Jul 26, 2024
@lichenzhen
Copy link

I'm running into the same issue. The codes were working three weeks ago.

MaartenGr added a commit that referenced this issue Jul 28, 2024
@MaartenGr MaartenGr mentioned this issue Jul 28, 2024
5 tasks
@MaartenGr
Copy link
Owner

I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.

@abhinavkulkarni
Copy link

abhinavkulkarni commented Jul 28, 2024

This doesn't solve the problem for me. I did install from the branch: pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100.

I'm training the model the following way:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
topic_model = topic_model.fit(docs, embeds)
path = Path(f"{save_dir}/model.bin")
topic_model.save(path.as_posix(), serialization="pickle")

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 topic_model = train_model()

Cell In[10], line 30
     28 # Pass the above models to be used in BERTopic
     29 topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
---> 30 topic_model = topic_model.fit(docs, embeds)
     31 path = Path(f"{save_dir}/model.bin")
     32 topic_model.save(path.as_posix(), serialization="pickle")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y)
    322 def fit(
    323     self,
    324     documents: List[str],
   (...)
    327     y: Union[List[int], np.ndarray] = None,
    328 ):
    329     """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
    330 
    331     Arguments:
   (...)
    362     ```
    363     """
--> 364     self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    365     return self

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    494 # Reduce topics
    495 if self.nr_topics:
--> 496     documents = self._reduce_topics(documents)
    498 # Save the top 3 most representative documents per topic
    499 self._save_representative_docs(documents)

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
   4345         documents = self._reduce_to_n_topics(documents, use_ctfidf)
   4346 elif isinstance(self.nr_topics, str):
-> 4347     documents = self._auto_reduce_topics(documents, use_ctfidf)
   4348 else:
   4349     raise ValueError("nr_topics needs to be an int or 'auto'! ")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4502, in BERTopic._auto_reduce_topics(self, documents, use_ctfidf)
   4500 self.topic_mapper_.add_mappings(mapped_topics)
   4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
   4503 self._update_topic_size(documents)
   4504 return documents

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:3985, in BERTopic._extract_topics(self, documents, embeddings, mappings, verbose)
   3983 self.c_tf_idf_, words = self._c_tf_idf(documents_per_topic)
   3984 self.topic_representations_ = self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
   3986 if verbose:
   3987     logger.info("Representation - Completed \u2713")

File ~/miniconda3/envs/python=3.10/lib/python3.10/site-packages/bertopic/_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
   4119 topic_embeddings_dict = {}
   4120 for topic_to, topics_from in mappings.items():
-> 4121     topic_ids = topics_from["topics_from"]
   4122     topic_sizes = topics_from["topic_sizes"]
   4123     if topic_ids:

KeyError: 'topics_from'

@ellenlnt
Copy link

The fix did not work for me either unfortunately!

@KlausikPL
Copy link

I have the same problem using the number of topics= auto

@MaartenGr
Copy link
Owner

Does anybody have a fully reproducible example (data included)? I ask because when I run the following after installing the fix from the related PR, I get no errors:

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

# Extract abstracts to train on and corresponding titles
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:10_000]

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(
    umap_model=umap_model, 
    hdbscan_model=hdbscan_model, 
    nr_topics="auto",
    verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings)

@jlee9095
Copy link

Dear MaartenGr, thank you for sharing the codes. Unfortunately, it does not work for the case when using a pipeline to run BERTopic for non-English text data.

To be specific, now I have the same problem (KeyError: 'topics_from') whenever trying to use the BERTopic commands. The commands worked well several weeks ago, but I don't know why it does not work now..
Since my data is not written in English, I am using a pipeline for my pre-trained model, as shown below.

"from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")"

In this case, the suggested commands did not work. If I copied the suggested commands and implemented them in my Python (in other words, if I try not to use my original pipeline but to use 'SentenceTransformer("all-MiniLM-L6-v2")', then the error appears like below.


ValueError Traceback (most recent call last)
Input In [24], in <cell line: 7>()
1 topic_model = BERTopic(
2 umap_model=umap_model,
3 hdbscan_model=hdbscan_model,
4 nr_topics="auto",
5 verbose=True
6 )
----> 7 topic_model = topic_model.fit(documents, embeddings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y)
322 def fit(
323 self,
324 documents: List[str],
(...)
327 y: Union[List[int], np.ndarray] = None,
328 ):
329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
330
331 Arguments:
(...)
362 ```
363 """
--> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
365 return self

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:492, in BERTopic.fit_transform(self, documents, embeddings, images, y)
489 self._save_representative_docs(custom_documents)
490 else:
491 # Extract topics by calculating c-TF-IDF
--> 492 self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
494 # Reduce topics
495 if self.nr_topics:

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3983, in BERTopic.extract_topics(self, documents, embeddings, mappings, verbose)
3981 logger.info("Representation - Extracting topics from clusters using representation models.")
3982 documents_per_topic = documents.groupby(["Topic"], as_index=False).agg({"Document": " ".join})
-> 3983 self.c_tf_idf
, words = self.c_tf_idf(documents_per_topic)
3984 self.topic_representations
= self._extract_words_per_topic(words, documents)
3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4194, in BERTopic._c_tf_idf(self, documents_per_topic, fit, partial_fit)
4192 X = self.vectorizer_model.partial_fit(documents).update_bow(documents)
4193 elif fit:
-> 4194 X = self.vectorizer_model.fit_transform(documents)
4195 else:
4196 X = self.vectorizer_model.transform(documents)

File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1330, in CountVectorizer.fit_transform(self, raw_documents, y)
1322 warnings.warn(
1323 "Upper case characters found in"
1324 " vocabulary while 'lowercase'"
1325 " is True. These entries will not"
1326 " be matched with any documents"
1327 )
1328 break
-> 1330 vocabulary, X = self.count_vocab(raw_documents, self.fixed_vocabulary)
1332 if self.binary:
1333 X.data.fill(1)

File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1220, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab)
1218 vocabulary = dict(vocabulary)
1219 if not vocabulary:
-> 1220 raise ValueError(
1221 "empty vocabulary; perhaps the documents only contain stop words"
1222 )
1224 if indptr[-1] > np.iinfo(np.int32).max: # = 2**31 - 1
1225 if _IS_32BIT:

ValueError: empty vocabulary; perhaps the documents only contain stop words


What should I do to solve this problem? T.T (Please understand that I cannot upload the data... But still the KeyError appears... Please help...)

@MaartenGr
Copy link
Owner

@jlee9095 I'm a bit confused. Are you saying that you have two separate issues? Because you mentioned that running the code I provided did not work for you. Could you share your full code to showcase both issues? Also, I'm not able to reproduce the issue so if you can reproduce the issue with dummy data (like the data I shared), I can easier figure out what is wrong.

@KeeratKG
Copy link
Author

KeeratKG commented Jul 30, 2024

@MaartenGr the fix #2101 works for me, thank you!
Happy to leave this issue open if y'all want to discuss more.

I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic.

Yes please.

@jlee9095
Copy link

@MaartenGr Thank you for your response. Yes, I have two separate issues. The errors that I uploaded above appear whenever I try to run your suggested commands as they are (that is, when using 'Sentence Transformer'). As an alternative, if I try to use my original pipeline from hugging face, then the error appears when running the 'embeddings = embedding_model.encode(documents, show_progress_bar=True)' command. Below are the commands and the errors for the second case.

(Commands for the case using the pipeline from hugging face)
import pandas as pd

docu = pd.read_csv('C:/Users/BERTopic/after_preprocessing.csv', engine='python')
len(docu)

documents = docu['text'].to_list()

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")

embedding_model = pretrained_model
embeddings = embedding_model.encode(documents, show_progress_bar=True)

(Then, the below error appears)

AttributeError Traceback (most recent call last)
Input In [20], in <cell line: 2>()
1 embedding_model = pretrained_model
----> 2 embeddings = embedding_model.encode(documents, show_progress_bar=True)

AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'

I am sorry that I am troubling to find a good example data, but I'll do my best to figure it out as well.

@jlee9095
Copy link

jlee9095 commented Jul 30, 2024

@MaartenGr Hi, here are two cases that I tested using the example data.

[Case 1. Commands]

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

dataset = load_dataset('klue','sts')["train"]
abstracts = dataset['sentence1'][:1000]

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings)

--------------------------------------------------------------------------- Then, I got the error like below.

KeyError Traceback (most recent call last)
Input In [7], in <cell line: 26>()
19 # Pass the above models to be used in BERTopic
20 topic_model = BERTopic(
21 umap_model=umap_model,
22 hdbscan_model=hdbscan_model,
23 nr_topics="auto",
24 verbose=True
25 )
---> 26 topic_model = topic_model.fit(abstracts, embeddings)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y)
322 def fit(
323 self,
324 documents: List[str],
(...)
327 y: Union[List[int], np.ndarray] = None,
328 ):
329 """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
330
331 Arguments:
(...)
362 ```
363 """
--> 364 self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
365 return self

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y)
494 # Reduce topics
495 if self.nr_topics:
--> 496 documents = self._reduce_topics(documents)
498 # Save the top 3 most representative documents per topic
499 self._save_representative_docs(documents)

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf)
4345 documents = self._reduce_to_n_topics(documents, use_ctfidf)
4346 elif isinstance(self.nr_topics, str):
-> 4347 documents = self._auto_reduce_topics(documents, use_ctfidf)
4348 else:
4349 raise ValueError("nr_topics needs to be an int or 'auto'! ")

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4502, in BERTopic.auto_reduce_topics(self, documents, use_ctfidf)
4500 self.topic_mapper
.add_mappings(mapped_topics)
4501 documents = self._sort_mappings_by_frequency(documents)
-> 4502 self._extract_topics(documents, mappings=mappings)
4503 self._update_topic_size(documents)
4504 return documents

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3985, in BERTopic.extract_topics(self, documents, embeddings, mappings, verbose)
3983 self.c_tf_idf
, words = self.c_tf_idf(documents_per_topic)
3984 self.topic_representations
= self._extract_words_per_topic(words, documents)
-> 3985 self._create_topic_vectors(documents=documents, embeddings=embeddings, mappings=mappings)
3986 if verbose:
3987 logger.info("Representation - Completed \u2713")

File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings)
4119 topic_embeddings_dict = {}
4120 for topic_to, topics_from in mappings.items():
-> 4121 topic_ids = topics_from["topics_from"]
4122 topic_sizes = topics_from["topic_sizes"]
4123 if topic_ids:

KeyError: 'topics_from'


[Case 2. Commands]

from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

dataset = load_dataset('klue','sts')["train"]
abstracts = dataset['sentence1'][:1000]

from transformers.pipelines import pipeline

pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")

embedding_model = pretrained_model
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings)

------------------------------------------------------------------------- Then, I got the error like below.

AttributeError Traceback (most recent call last)
Input In [14], in <cell line: 17>()
14 pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")
16 embedding_model = pretrained_model
---> 17 embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
19 # Use sub-models
20 umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)

AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'


How can I solve this problem..? All your help will be greatly appreciated...

@MaartenGr
Copy link
Owner

@jlee9095 The second example does not seem related to this particular issue. Generally, I would advise opening up a new issue for that but it seems that you are using the encode function which is not supported for a Hugging Face pipeline. Please refer to the pipeline documentation of HF on how to extract embeddings.

With respect to your first problem, it seems that the PR I linked resolves the problem. Make sure that when you install that PR, that you are certain the PR is properly installed and that you are not using the official release.

@WJG100
Copy link

WJG100 commented Jul 31, 2024

For the error “[KeyError: 'topics_from']”,I download the lower edition 0.16.0 and solve this problem successfully.

@smbslt3
Copy link

smbslt3 commented Aug 11, 2024

When I set the nr_topics="auto" parameter, I encounter the following error:

topic_model = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    # min_topic_size = 100,   # Split sentences "All"
    nr_topics="auto",  # Automatically detect the number of topics
    # nr_topics = 10, #40,   # Limit the total number of topics
    top_n_words=10,   # Use the top n words
    calculate_probabilities=True,
    umap_model=umap_model,  # Fix UMAP random state
    hdbscan_model=hdbscan_model  # Set HDBSCAN model
)

When I comment out the line nr_topics="auto", the error does not occur. However, when I set this parameter to 'auto', I get a KeyError: 'topics_from'. When set nr_topics=10 the code run properly.

@MaartenGr
Copy link
Owner

@smbslt3 Have you tried the PR that I shared above? In my experience, it should fix the issue.

@Izaac-Thomas
Copy link

@MaartenGr Hi Maarten! I can't speak on behalf of @smbslt3 but I was experiencing the same issue and the changes to bertopic.py in #2101 fixed the issue for me.

It may also be worth noting to anybody that is still facing this issue that if you installed this library through pip and are trying to update by doing something along the lines of pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100 like @abhinavkulkarni was, this didn't actually update any code for me and I had to manually change the few lines of code in my local site-packages folder in my Anaconda environment.

Once this change is included in an official release (0.16.4) I'd assume that simply running pip install bertopic==0.16.4 will fix the issue for anyone using pip and still experiencing this issue.

@Yif18
Copy link

Yif18 commented Aug 15, 2024

I'm having the same issue, KeyError: 'topics_from', my workaround is pip install bertopic==0.16.2.
It can be seen that there is a problem with the new version 0.16.3, and I hope to fix it in the next version.

@MaartenGr
Copy link
Owner

To everyone facing this issue, make sure you do not have BERTopic installed before you run pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100. This should install the related PR (#2101) and solve the issue.

Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.

MaartenGr added a commit that referenced this issue Aug 21, 2024
@kungmo
Copy link

kungmo commented Sep 17, 2024

To everyone facing this issue, make sure you do not have BERTopic installed before you run pip install git+https://github.com/MaartenGr/BERTopic.git@fix_2100. This should install the related PR (#2101) and solve the issue.

Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch.

I also have same issue. Due to your help, I can fix this problem. Thank you. I hope this bug be solved in 0.16.4 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.