-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError: 'topics_from' #2100
Comments
I'm running into the same issue. The codes were working three weeks ago. |
I just created a PR that should resolve this issue, could you test whether it works for you? If so, I will go ahead and create a new release (0.16.4) since this affects the core functionality of BERTopic. |
This doesn't solve the problem for me. I did install from the branch: I'm training the model the following way: from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, nr_topics="auto")
topic_model = topic_model.fit(docs, embeds)
path = Path(f"{save_dir}/model.bin")
topic_model.save(path.as_posix(), serialization="pickle") I get the following error:
|
The fix did not work for me either unfortunately! |
I have the same problem using the number of topics= auto |
Does anybody have a fully reproducible example (data included)? I ask because when I run the following after installing the fix from the related PR, I get no errors: from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP
# Extract abstracts to train on and corresponding titles
dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
abstracts = dataset["abstract"][:10_000]
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics="auto",
verbose=True
)
topic_model = topic_model.fit(abstracts, embeddings) |
Dear MaartenGr, thank you for sharing the codes. Unfortunately, it does not work for the case when using a pipeline to run BERTopic for non-English text data. To be specific, now I have the same problem (KeyError: 'topics_from') whenever trying to use the BERTopic commands. The commands worked well several weeks ago, but I don't know why it does not work now.. "from transformers.pipelines import pipeline pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base")" In this case, the suggested commands did not work. If I copied the suggested commands and implemented them in my Python (in other words, if I try not to use my original pipeline but to use 'SentenceTransformer("all-MiniLM-L6-v2")', then the error appears like below. ValueError Traceback (most recent call last) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:492, in BERTopic.fit_transform(self, documents, embeddings, images, y) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3983, in BERTopic.extract_topics(self, documents, embeddings, mappings, verbose) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4194, in BERTopic._c_tf_idf(self, documents_per_topic, fit, partial_fit) File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1330, in CountVectorizer.fit_transform(self, raw_documents, y) File ~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:1220, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab) ValueError: empty vocabulary; perhaps the documents only contain stop words What should I do to solve this problem? T.T (Please understand that I cannot upload the data... But still the KeyError appears... Please help...) |
@jlee9095 I'm a bit confused. Are you saying that you have two separate issues? Because you mentioned that running the code I provided did not work for you. Could you share your full code to showcase both issues? Also, I'm not able to reproduce the issue so if you can reproduce the issue with dummy data (like the data I shared), I can easier figure out what is wrong. |
@MaartenGr the fix #2101 works for me, thank you!
Yes please. |
@MaartenGr Thank you for your response. Yes, I have two separate issues. The errors that I uploaded above appear whenever I try to run your suggested commands as they are (that is, when using 'Sentence Transformer'). As an alternative, if I try to use my original pipeline from hugging face, then the error appears when running the 'embeddings = embedding_model.encode(documents, show_progress_bar=True)' command. Below are the commands and the errors for the second case. (Commands for the case using the pipeline from hugging face) docu = pd.read_csv('C:/Users/BERTopic/after_preprocessing.csv', engine='python') documents = docu['text'].to_list() from sentence_transformers import SentenceTransformer from transformers.pipelines import pipeline pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base") embedding_model = pretrained_model (Then, the below error appears)AttributeError Traceback (most recent call last) AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode'I am sorry that I am troubling to find a good example data, but I'll do my best to figure it out as well. |
@MaartenGr Hi, here are two cases that I tested using the example data. [Case 1. Commands] from sentence_transformers import SentenceTransformer dataset = load_dataset('klue','sts')["train"] embedding_model = SentenceTransformer("all-MiniLM-L6-v2") umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) topic_model = BERTopic( --------------------------------------------------------------------------- Then, I got the error like below. KeyError Traceback (most recent call last) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:364, in BERTopic.fit(self, documents, embeddings, images, y) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:496, in BERTopic.fit_transform(self, documents, embeddings, images, y) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4347, in BERTopic._reduce_topics(self, documents, use_ctfidf) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4502, in BERTopic.auto_reduce_topics(self, documents, use_ctfidf) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:3985, in BERTopic.extract_topics(self, documents, embeddings, mappings, verbose) File ~\anaconda3\lib\site-packages\bertopic_bertopic.py:4121, in BERTopic._create_topic_vectors(self, documents, embeddings, mappings) KeyError: 'topics_from' [Case 2. Commands] from sentence_transformers import SentenceTransformer dataset = load_dataset('klue','sts')["train"] from transformers.pipelines import pipeline pretrained_model = pipeline("feature-extraction", model="beomi/kcbert-base") embedding_model = pretrained_model umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42) topic_model = BERTopic( ------------------------------------------------------------------------- Then, I got the error like below. AttributeError Traceback (most recent call last) AttributeError: 'FeatureExtractionPipeline' object has no attribute 'encode' How can I solve this problem..? All your help will be greatly appreciated... |
@jlee9095 The second example does not seem related to this particular issue. Generally, I would advise opening up a new issue for that but it seems that you are using the With respect to your first problem, it seems that the PR I linked resolves the problem. Make sure that when you install that PR, that you are certain the PR is properly installed and that you are not using the official release. |
For the error “[KeyError: 'topics_from']”,I download the lower edition 0.16.0 and solve this problem successfully. |
When I set the topic_model = BERTopic(
embedding_model=sentence_model,
vectorizer_model=vectorizer_model,
# min_topic_size = 100, # Split sentences "All"
nr_topics="auto", # Automatically detect the number of topics
# nr_topics = 10, #40, # Limit the total number of topics
top_n_words=10, # Use the top n words
calculate_probabilities=True,
umap_model=umap_model, # Fix UMAP random state
hdbscan_model=hdbscan_model # Set HDBSCAN model
) When I comment out the line |
@smbslt3 Have you tried the PR that I shared above? In my experience, it should fix the issue. |
@MaartenGr Hi Maarten! I can't speak on behalf of @smbslt3 but I was experiencing the same issue and the changes to bertopic.py in #2101 fixed the issue for me. It may also be worth noting to anybody that is still facing this issue that if you installed this library through pip and are trying to update by doing something along the lines of Once this change is included in an official release (0.16.4) I'd assume that simply running |
I'm having the same issue, KeyError: 'topics_from', my workaround is pip install bertopic==0.16.2. |
To everyone facing this issue, make sure you do not have BERTopic installed before you run Based on this thread, I can confirm that if the PR is correctly installed, it should solve the issue. I intend to release a new version whenever #2105 is also merged into the main branch. |
I also have same issue. Due to your help, I can fix this problem. Thank you. I hope this bug be solved in 0.16.4 version. |
Have you searched existing issues? 🔎
Desribe the bug
When trying to run
topics, probs = TM.fit_transform(docs)
wheredocs
is a list of strings (we want to cluster topics based on these strings), I run into the following error:This happens after the following steps of training have already taken place:
Reproduction
BERTopic Version
0.16.13
The text was updated successfully, but these errors were encountered: