Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blank labels issue with 2d documents Visualization #1920

Open
mahmawad opened this issue Apr 10, 2024 · 5 comments
Open

blank labels issue with 2d documents Visualization #1920

mahmawad opened this issue Apr 10, 2024 · 5 comments

Comments

@mahmawad
Copy link

image

when I run topicmodeling in .py script I got this issue

@MaartenGr
Copy link
Owner

Thanks for sharing but I am not familiar with your .py script. I will need a bit more information to understand what is happening here. Could you share your full code along with the version of BERTopic you are using?

@mahmawad
Copy link
Author

thank you for replying

a normal importing for llama 2 and then I save visualization using write_html function
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df_articles['PreprocessedText'].tolist(), show_progress_bar=True)


# ft = api.load('fasttext-wiki-news-subwords-300')
# 

# In[18]:


from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)


# In[19]:


reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)


# In[20]:


from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True,min_cluster_size=15)


# In[20]:


from sklearn.cluster import KMeans

#cluster_model = KMeans(n_clusters=6, random_state=42)
cluster_model = KMeans(random_state=42,n_clusters=11)


# In[21]:


from sklearn.feature_extraction.text import CountVectorizer

# Custom list of words to exclude
custom_exclude_words = ["world", "automotive", "post",'first','new','car','cars','vehicle','vehicles','say']
# Merge the custom words with the standard stop words
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)


# In[22]:


from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration

# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)

# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert,
    "Llama2": llama2,
    "MMR": mmr,
}


# In[2]:

"""
import torch
print(torch.cuda.memory_summary(device=None, abbreviated=False))
torch.cuda.empty_cache()

"""
# In[23]:


topics_inp=df_articles['PreprocessedText'].tolist()


# In[24]:


from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
      vectorizer_model=vectorizer_model,
  umap_model=umap_model,

  hdbscan_model=cluster_model,
  representation_model=representation_model,
  #ctfidf_model=ctfidf_model,
  # Hyperparameters
  top_n_words=10,
  verbose=True,

)

topics, probs = topic_model.fit_transform(topics_inp,embeddings)


# In[27]:


#topic_model.merge_topics(df_articles['PreprocessedText'].tolist(),[5,2])


# In[ ]:


# use one of the other topic representations, like KeyBERTInspired
#keybert_topic_labels = {topic: " | ".join(list(zip(*values))[0][:4]) for topic, values in topic_model.topic_aspects_["Llama2"].items()}
#topic_model.set_topic_labels(keybert_topic_labels)


# In[28]:


llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()]
topic_model.set_topic_labels(llama2_labels)


# In[29]:


topic_model.get_topic_info()


# In[30]:


# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
viss=topic_model.visualize_documents(topics_inp, custom_labels=True,hide_annotations=False,hide_document_hover=False)
path_file = r"/home/amahmoud/workspace/vis_two_week_visul.html"
viss.write_html(path_file)

@MaartenGr
Copy link
Owner

Could you check what labels you set in llama2_labels? There might be something going on there that Llama 2 might not have created all labels.

@mahmawad
Copy link
Author

i checked them but i think the problem is when I run it in a py script. it works well when i run it in Jupyter Notebook but I need it in py file so it could be automated

@MaartenGr
Copy link
Owner

That's strange as the output is actually HTML I believe and should not render differently in a Jupyter Notebook compared to using .py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants