-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alibaba-NLP/gte-Qwen2-1.5B-instruct #8
Comments
having same issue on my end |
Same issue here with
|
Thanks for reporting this. Do you have any idea why this error could arise @michaelfeil? |
@TimPietrusky Because flash-attn is not installed. Solution: You can use vllm for the 7b model embeddings, as qwen is a decoder model. And not built for high throughput embeddings - its actually pretty annoying to support that. You can always build your own docker image, install pip install flash-attn & use that one. I recommend installing it from the prebuild wheels from tri dao, to not deal with nvcc. |
@michaelfeil thank you, makes total sense! @axeloh can you please take a look at the comment above? @pandyamarut we should update our README to mention that we are not supporting qwen, so that users are aware to use our vllm-worker or something similar. |
Hi 😄
I am trying to run the
Alibaba-NLP/gte-Qwen2-1.5B-instruct
model (https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) on RunPod serverless.I am using the docker image
runpod/worker-infinity-embedding:dev-cuda11.8.0
.Upon incoming requests, the pod logs shows this error:
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
The text was updated successfully, but these errors were encountered: