Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero shot topic modelling #2168

Open
1 task done
ankitkr3 opened this issue Oct 4, 2024 · 14 comments
Open
1 task done

Zero shot topic modelling #2168

ankitkr3 opened this issue Oct 4, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@ankitkr3
Copy link

ankitkr3 commented Oct 4, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

Getting a value error for undefined reason,
ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.

i have checked embeddings are working fine for test results.

import openai
from bertopic.representation import OpenAI
from bertopic import BERTopic
from langchain.embeddings import OpenAIEmbeddings

# Then use the following
my_key = "12323m2em2rm,2lr,2.f,."
client = openai.OpenAI(api_key = my_key)

embedding_model = OpenAIBackend(client, "text-embedding-ada-002")


summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in a one statement in the following format:
topic: <description>
"""


# embedding_model = OpenAIBackend(client, "text-embedding-ada-002")

representation_model = OpenAI(client = client, model="gpt-4o", chat=True, prompt=summarization_prompt, 
                             nr_docs=5, delay_in_seconds=3)

vectorizer_model = CountVectorizer(min_df=1)
topic_model = BERTopic(
   embedding_model=embedding_model, 
   min_topic_size=25,
   zeroshot_topic_list=zeroshot_topic_list,
   zeroshot_min_similarity=0,
   representation_model=representation_model
)

topics =topic_model.fit_transform(df['title'].values)

Reproduction

from bertopic import BERTopic

BERTopic Version

0.16.3

@ankitkr3 ankitkr3 added the bug Something isn't working label Oct 4, 2024
@MaartenGr
Copy link
Owner

Thanks for sharing. Could you add the full error log? Without it, it is difficult for me to say where exactly it is going wrong. Also, did you make that the documents are a list and not a pandas series?

@ankitkr3
Copy link
Author

ankitkr3 commented Oct 7, 2024

Full Error:

`OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 topics, probs = topic_model.fit_transform(docs)

File ~/Library/Python/3.12/lib/python/site-packages/bertopic/_bertopic.py:457, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    453     documents, embeddings, assigned_documents, assigned_embeddings = self._zeroshot_topic_modeling(
    454         documents, embeddings
    455     )
    456     # Filter UMAP embeddings to only non-assigned embeddings to be used for clustering
--> 457     umap_embeddings = self.umap_model.transform(embeddings)
    459 if len(documents) > 0:  # No zero-shot topics matched
    460     # Cluster reduced embeddings
    461     documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

File ~/Library/Python/3.12/lib/python/site-packages/umap/umap_.py:2935, in UMAP.transform(self, X, force_all_finite)
   2933     X = check_array(X, dtype=np.uint8, order="C", force_all_finite=force_all_finite)
   2934 else:
-> 2935     X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C", force_all_finite=force_all_finite)
   2936 x_hash = joblib.hash(X)
   2937 if x_hash == self._input_hash:

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/utils/validation.py:1087, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   1085     n_samples = _num_samples(array)
   1086     if n_samples < ensure_min_samples:
-> 1087         raise ValueError(
   1088             "Found array with %d sample(s) (shape=%s) while a"
   1089             " minimum of %d is required%s."
   1090             % (n_samples, array.shape, ensure_min_samples, context)
   1091         )
   1093 if ensure_min_features > 0 and array.ndim == 2:
   1094     n_features = array.shape[1]

ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.`

Data is in list of strings format like '[
'Mahesh babu home tour | Mahesh babu dattata village |#maheshbabu #superstar',
'ROSHAN LATEST HIMACHALI SONG 2021 VINAY SAGAR ATUL SHARMA',]

@ankitkr3
Copy link
Author

ankitkr3 commented Oct 9, 2024

@MaartenGr ??

@MaartenGr
Copy link
Owner

@ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends.

Replying with ?? feels like my work and effort on this are not appreciated, so I ask you to be patient in the future.

Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using embedding_model? This helps me understand whether it is the openAI backend that is the issue.

Also, I see that you use df['title'].values which gives back a numpy array and not a list if I'm not mistaken. If you indeed passed a list of strings, then I wonder whether you indeed used df['title'].values and not something else. Either way, perhaps using df['title'].values.tolist() solves the issue.

@ankitkr3
Copy link
Author

ankitkr3 commented Oct 9, 2024 via email

@ankitkr3
Copy link
Author

ankitkr3 commented Oct 9, 2024

@MaartenGr still getting the same error after trying with df['title'].values.tolist(), can there be anything else

@MaartenGr
Copy link
Owner

I can think of two other things.

First, have you tried it with a non-OpenAI embedding model? So simply without using the embedding_model at all for example.

Second, there might be an issue with zero-shot topic modeling for which I just pushed a new release that includes a fix. Using BERTopic v0.16.4 might help.

@yanivc-jfrog
Copy link

I'm with BERTopic v0.16.4 and the error persists

@MaartenGr
Copy link
Owner

@yanivc-jfrog Did you try it with a with a non-OpenAI embedding model? So simply without using the embedding_model at all for example.

Also, do you perhaps have a reproducible example that I can test locally?

@yanivc-jfrog
Copy link

yanivc-jfrog commented Nov 8, 2024

Yes, I did it with this embedding model "avsolatorio/GIST-small-Embedding-v0" using SentenceTransformer, but even "sentence-transformers/all-mpnet-base-v2" returned the same problem.
I also tried removing all parameters and just called it with all the defaults.

topic_model = BERTopic()
topic, topic_proba = topic_model.fit_transform(['I am going home', 'I am going to the store'])

The above still returns that warning - and also never finishes (stopped manually after 3 minutes)

@yanivc-jfrog
Copy link

topic_model = BERTopic()
topic, topic_proba = topic_model.fit_transform([
'I am going home',
'I am going to the supermarket',
'I am going to the gym',
'I am going to the store',
'I am going to the court for a legal issue',
'I am going to the groceries store',
'I am going to the football court',
'I am going to the basketball court',
'I am going to the supermarket',
'I am going to the gym',
])
The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4

@MaartenGr
Copy link
Owner

@yanivc-jfrog

The above still returns that warning - and also never finishes (stopped manually after 3 minutes)

What warning? The OP mentions and error and not a warning. So how could the model then never finish if it encounters an error?

Can you please create a full example, including the entire code and error log?

The above works for me in Google Colab (without a warning), but not in my Jupyter notebook locally (warning + stuck forever), though both (Colab & my local notebook) have v0.16.4

Have you tried installing BERTopic from a completely fresh environment? Based on your description, it seems there are issues with your environment. Starting new typically helps.

@MalikRumi
Copy link

Dear Maarten,

I had a little trouble installing, but I got it to work. I am posting in case my experience helps you or other users. This is all with just out of the box defaults. I am using:
Poetry (version 1.8.2)
pip 24.3.1
Python 3.12
PyCharm 2024.3
bertopic-0.16.4
Apple Silicon M1
I have mlx but it is not installed in this repo.

I got a warning message almost identical to other posters here, but there was no traceback and there was no error:

"Process finished with exit code 0"

I named the repo after you because PyCharm rejected Bertopic because the repo can't have the same name as a dependency.

/Projects/maarten/quickstart.py:

At the beginning, I got the same message as everyone else:

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
I tracked this down to documentation from 2018!!:

https://www.openmp.org/spec-html/5.0/openmpsu125.html
OPENMP API Specification: Version 5.0 November 2018
3.2.16 omp_set_max_active_levels
Summary
The omp_set_max_active_levels routine limits the number of nested active parallel regions on the device, by setting the max-active-levels-var ICV

C/C++: void omp_set_max_active_levels(int max_levels);

FORTRAN: subroutine omp_set_max_active_levels(max_levels)
integer max_levels

Copyright ©1997-2018 OpenMP Architecture Review Board.

Here is a complete explanation, courtesy of Google Gemini:

OpenMP, or Open Multi-Processing, is an Application Programming Interface (API) that allows developers to write parallel programs in C, C++, and Fortran:
What it does
OpenMP is a set of compiler directives, library routines, and environment variables that support shared-memory parallel programming. It's designed to be portable and scalable, and can be used on many platforms, including Linux, macOS, Windows, and Solaris.
How it works
OpenMP uses the fork-join model of parallel execution. A single master thread executes sequentially until it encounters a parallel region, at which point it creates a team of parallel threads. When the parallel region is complete, the team threads synchronize and terminate, leaving only the master thread to execute sequentially.
How it's managed
The OpenMP Architecture Review Board (OpenMP ARB) is a nonprofit technology consortium that manages OpenMP. The OpenMP ARB is made up of representatives from many major computer hardware and software vendors, including AMD, IBM, Intel, and Nvidia.
Benefits
OpenMP provides a simple and flexible interface for developing parallel applications. It's also portable, so you can use the same code with different compilers without changing the source code.
The stable release of OpenMP is 6.0, which is scheduled for November 2024.

Then I got the same message others have shared, but I got it 5 times, which suggests the triggering condition occurs 5 times, but that's just a guess:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:

  • Avoid using tokenizers before the fork if possible
  • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

The thing is, I knew that environmental variable was not Python syntax, but it turned out ok once I found this solution on Stack Overflow:

https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning

All the warnings went away, and I got the expected result, but the OMP part of the message is still there. So, Maarten, I think this is coming from somewhere deep inside huggingface's tokenizer. Whatever it is, it might still work but it is wildly out of date.

Hope this helps.

@MaartenGr
Copy link
Owner

@MalikRumi Thank you for sharing this and for taking the time to track this down to tokenizers. I had seen that environment variable before (which needed to be set in Kaggle notebooks I believe), but as you mentioned that was many years ago. Strange that it pops up now.

Either way, thanks for sharing your solution! It is greatly appreciated 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants