Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"doc_length" doesn't work with llama3.1 #2185

Open
1 task done
mjin990 opened this issue Oct 15, 2024 · 2 comments
Open
1 task done

"doc_length" doesn't work with llama3.1 #2185

mjin990 opened this issue Oct 15, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mjin990
Copy link

mjin990 commented Oct 15, 2024

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

I am using beropic with llama3.1 for topic modelling. My text is long, so I use doc_length in TextGeneration().

Error:

File "/home/bert/lib/python3.11/site-packages/bertopic/representation/_utils.py", line 57, in truncate_document
return truncated_document
^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'truncated_document' where it is not associated with a value

Reproduction

from bertopic import BERTopic

**here is my code:**

llama3 = TextGeneration(generator, prompt=prompt, nr_docs=4,doc_length=3000)
representation_model = {
    "Llama3": llama3
}
topic_model = BERTopic(
  embedding_model=embedding_model,
  representation_model=representation_model,
  umap_model=umap_model, 
  hdbscan_model=hdbscan_model,
  nr_topics = nr_topics,
  min_topic_size = 10,
  verbose=True,
)

BERTopic Version

0.16.4

@mjin990 mjin990 added the bug Something isn't working label Oct 15, 2024
@MaartenGr
Copy link
Owner

Thanks for sharing! I believe you also need to specify the tokenizer for it to work. There's also a PR open for a fix that I will check out later this week. That said, should work by specifying tokenizer.

@mjin990
Copy link
Author

mjin990 commented Oct 15, 2024

Thanks for sharing! I believe you also need to specify the tokenizer for it to work. There's also a PR open for a fix that I will check out later this week. That said, should work by specifying tokenizer.

Thanks for your quick reply!
Yes, it works after adding tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants