-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero shot topic modelling #2168
Comments
Thanks for sharing. Could you add the full error log? Without it, it is difficult for me to say where exactly it is going wrong. Also, did you make that the documents are a list and not a pandas series? |
Full Error:
Data is in list of strings format like '[ |
@MaartenGr ?? |
@ankitkr3 I want to help everyone out as much as possible on this repository but I should mention that I am just a single developer providing all this work for free. This means it might take me a couple of days to respond since I work on this in the evenings and weekends. Replying with Regarding the issue, it seems that the structure of the embeddings is the main problem which might be a result of either the format of the documents or the embedding model. Can you try it again without using Also, I see that you use |
Extremely sorry if you felt that way, let me try it in that way and let you
know.
…On Wed, 9 Oct 2024 at 4:39 PM, Maarten Grootendorst < ***@***.***> wrote:
@ankitkr3 <https://github.com/ankitkr3> I want to help everyone out as
much as possible on this repository but I should mention that I am just a
single developer providing all this work for free. This means it might take
me a couple of days to respond since I work on this in the evenings and
weekends.
Replying with ?? feels like my work and effort on this are not
appreciated, so I ask you to be patient in the future.
Regarding the issue, it seems that the structure of the embeddings is the
main problem which might be a result of either the format of the documents
or the embedding model. Can you try it again without using embedding_model?
This helps me understand whether it is the openAI backend that is the issue.
Also, I see that you use df['title'].values which gives back a numpy
array and not a list if I'm not mistaken. If you indeed passed a list of
strings, then I wonder whether you indeed used df['title'].values and not
something else. Either way, perhaps using df['title'].values.tolist()
solves the issue.
—
Reply to this email directly, view it on GitHub
<#2168 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFJ6TTVXJTAMNXZKLON6II3Z2UFGVAVCNFSM6AAAAABPL2ZWK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGAZDAOJSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@MaartenGr still getting the same error after trying with df['title'].values.tolist(), can there be anything else |
I can think of two other things. First, have you tried it with a non-OpenAI embedding model? So simply without using the Second, there might be an issue with zero-shot topic modeling for which I just pushed a new release that includes a fix. Using BERTopic v0.16.4 might help. |
I'm with BERTopic v0.16.4 and the error persists |
@yanivc-jfrog Did you try it with a with a non-OpenAI embedding model? So simply without using the embedding_model at all for example. Also, do you perhaps have a reproducible example that I can test locally? |
Yes, I did it with this embedding model "avsolatorio/GIST-small-Embedding-v0" using SentenceTransformer, but even "sentence-transformers/all-mpnet-base-v2" returned the same problem.
The above still returns that warning - and also never finishes (stopped manually after 3 minutes) |
topic_model = BERTopic() |
What warning? The OP mentions and error and not a warning. So how could the model then never finish if it encounters an error? Can you please create a full example, including the entire code and error log?
Have you tried installing BERTopic from a completely fresh environment? Based on your description, it seems there are issues with your environment. Starting new typically helps. |
Dear Maarten, I had a little trouble installing, but I got it to work. I am posting in case my experience helps you or other users. This is all with just out of the box defaults. I am using: I got a warning message almost identical to other posters here, but there was no traceback and there was no error:
I named the repo after you because PyCharm rejected Bertopic because the repo can't have the same name as a dependency.
At the beginning, I got the same message as everyone else:
Here is a complete explanation, courtesy of Google Gemini:
Then I got the same message others have shared, but I got it 5 times, which suggests the triggering condition occurs 5 times, but that's just a guess:
The thing is, I knew that environmental variable was not Python syntax, but it turned out ok once I found this solution on Stack Overflow:
All the warnings went away, and I got the expected result, but the OMP part of the message is still there. So, Maarten, I think this is coming from somewhere deep inside huggingface's tokenizer. Whatever it is, it might still work but it is wildly out of date. Hope this helps. |
@MalikRumi Thank you for sharing this and for taking the time to track this down to Either way, thanks for sharing your solution! It is greatly appreciated 😄 |
Have you searched existing issues? 🔎
Desribe the bug
Getting a value error for undefined reason,
ValueError: Found array with 0 sample(s) (shape=(0, 1536)) while a minimum of 1 is required.
i have checked embeddings are working fine for test results.
Reproduction
BERTopic Version
0.16.3
The text was updated successfully, but these errors were encountered: