-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216
Comments
Did you encountered issue when using the latest version of OpenVINO™ GenAI? You may use the latest OpenVINO™ GenAI and use NPU by following the steps in Run LLMs with OpenVINO GenAI Flavor on NPU. |
In addition to using the latest OpenVINO GenAI, if you haven't exported the model with
Also note that for NPU, you should add |
Unfortunately, there is an issue with using NPU for LLMs on Ubuntu. The NPU team is working on it; the issue is not with OpenVINO, but on the kernel/driver level. I am sorry you're running into this. We will keep you informed. |
This is 32GB MTL.. Ticket was opened for TinyLLaMa, but logs are mentioning group-quantized QWEN2-7B - a completely different league |
@taikai-zz a new NPU driver was released today with a fix for LLM on LNL: https://github.com/intel/linux-npu-driver/releases/tag/v1.10.1 Could you check if that fixes the issue for you? We also had a new openvino-genai release this week, 2024.6, with performance improvements on NPU, so please upgrade with Also note that for running larger LLMs (>4B) you should use per-channel quantization. This note will be added to the docs too, I'm mentioning it here because I see you're using a 7B model. Instead of group-size 128, you should specify group-size -1 (note the minus sign). This is an example from the docs for Llama-2-7b: |
I'm glad to hear the issue is fixed! For faster speed, please see the document (same one you screenshotted) about model caching. That will speed up model loading time. Since model loading time only occurs once, it's also useful to measure inference time, by adding The |
I am trying to run the chat sample with NPU on Windows. |
Hi @kmaki565! Could you please clarify the following:
Thanks! |
Thank you for your support. |
OpenVINO Version
Name: openvino
Version: 2024.4.0
Summary: OpenVINO(TM) Runtime
Home-page: https://docs.openvino.ai/2023.0/index.html
Author: Intel(R) Corporation
Author-email: [email protected]
License: OSI Approved :: Apache Software License
Location: /root/openvino_env/lib/python3.12/site-packages
Requires: numpy, openvino-telemetry, packaging
Required-by: openvino-tokenizers
Operating System
Ubuntu 24.04 LTS Linux ubuntu 6.8.0-48-generic
Device used for inference
NPU
Framework
None
Model used
TinyLlama
Issue description
Refer to official documentation:
https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html
This is my hardware information
import openvino_genai as ov_genai
help(ov_genai.LLMPipeline)
As shown in the above figure: The device does not support NPU. I followed the instructions in the document and changed it to NPU, but the result was empty. Changing it to CPU or GPU restored normal operation. May I ask where I made the operation error?
There is another issue: if you check the usage rate of NPU in Ubuntu environment, such as tools like nvidia-smi
The text was updated successfully, but these errors were encountered: