OpenVINO™ GenAI is a flavor of OpenVINO™, aiming to simplify running inference of generative AI models. It hides the complexity of the generation process and minimizes the amount of code required.
NOTE: Please make sure that you are following the versions compatibility rules, refer to the OpenVINO™ GenAI Dependencies for more information.
The OpenVINO™ GenAI flavor is available for installation via Archive and PyPI distributions. To install OpenVINO™ GenAI, refer to the Install Guide.
To build OpenVINO™ GenAI library from source, refer to the Build Instructions.
OpenVINO™ GenAI depends on OpenVINO and OpenVINO Tokenizers.
When installing OpenVINO™ GenAI from PyPi, the same versions of OpenVINO and OpenVINO Tokenizers are used (e.g. openvino==2024.3.0
and openvino-tokenizers==2024.3.0.0
are installed for openvino-genai==2024.3.0
).
If you update one of the dependency packages (e.g. pip install openvino --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
), versions might be incompatible due to different ABI and running OpenVINO GenAI can result in errors (e.g. ImportError: libopenvino.so.2430: cannot open shared object file: No such file or directory
).
Having packages version in format <MAJOR>.<MINOR>.<PATCH>.<REVISION>
, only <REVISION>
part of the full version can be varied to ensure ABI compatibility, while changing <MAJOR>
, <MINOR>
or <PATCH>
parts of the version might break ABI.
GenAI, Tokenizers, and OpenVINO wheels for Linux on PyPI are compiled with _GLIBCXX_USE_CXX11_ABI=0
to cover a wider range of platforms. In contrast, C++ archive distributions for Ubuntu are compiled with _GLIBCXX_USE_CXX11_ABI=1
. It is not possible to mix different Application Binary Interfaces (ABIs) because doing so results in a link error. This incompatibility prevents the use of, for example, OpenVINO from C++ archive distributions alongside GenAI from PyPI.
If you want to try OpenVINO GenAI with different dependencies versions (not prebuilt packages as archives or python wheels), build OpenVINO GenAI library from source.
-
Installed OpenVINO™ GenAI
To use OpenVINO GenAI with models that are already in OpenVINO format, no additional python dependencies are needed. To convert models with optimum-cli and to run the examples, install the dependencies in ./samples/requirements.txt:
# (Optional) Clone OpenVINO GenAI repository if it does not exist git clone --recursive https://github.com/openvinotoolkit/openvino.genai.git cd openvino.genai # Install python dependencies python -m pip install ./thirdparty/openvino_tokenizers/[transformers] --pre --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly python -m pip install --upgrade-strategy eager -r ./samples/requirements.txt
-
A model in OpenVINO IR format
Download and convert a model with
optimum-cli
:optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
LLMPipeline
is the main object used for decoding. You can construct it straight away from the folder with the converted model. It will automatically load the main model, tokenizer, detokenizer and default generation configuration.
A simple example:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100))
Calling generate with custom generation config parameters, e.g. config for grouped beam search:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
result = pipe.generate("The Sun is yellow because", max_new_tokens=100, num_beam_groups=3, num_beams=15, diversity_penalty=1.5)
print(result)
output:
'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
A simple chat in Python:
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path)
config = {'max_new_tokens': 100, 'num_beam_groups': 3, 'num_beams': 15, 'diversity_penalty': 1.5}
pipe.set_generation_config(config)
pipe.start_chat()
while True:
print('question:')
prompt = input()
if prompt == 'Stop!':
break
print(pipe(prompt, max_new_tokens=200))
pipe.finish_chat()
Test to compare with Huggingface outputs
A simple example:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(256));
}
Using group beam search decoding:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 256;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
std::cout << pipe.generate("The Sun is yellow because", config);
}
A simple chat in C++ using grouped beam search decoding:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string prompt;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.num_beam_groups = 3;
config.num_beams = 15;
config.diversity_penalty = 1.0f;
pipe.start_chat();
for (;;;) {
std::cout << "question:\n";
std::getline(std::cin, prompt);
if (prompt == "Stop!")
break;
std::cout << "answer:\n";
auto answer = pipe(prompt, config);
std::cout << answer << std::endl;
}
pipe.finish_chat();
}
Streaming example with lambda function:
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
auto streamer = [](std::string word) {
std::cout << word << std::flush;
// Return flag corresponds whether generation should be stopped.
// false means continue generation.
return false;
};
std::cout << pipe.generate("The Sun is yellow bacause", ov::genai::streamer(streamer), ov::genai::max_new_tokens(200));
}
Streaming with a custom class:
#include "openvino/genai/streamer_base.hpp"
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
class CustomStreamer: public ov::genai::StreamerBase {
public:
bool put(int64_t token) {
bool stop_flag = false;
/*
custom decoding/tokens processing code
tokens_cache.push_back(token);
std::string text = m_tokenizer.decode(tokens_cache);
...
*/
return stop_flag; // flag whether generation should be stoped, if true generation stops.
};
void end() {
/* custom finalization */
};
};
int main(int argc, char* argv[]) {
CustomStreamer custom_streamer;
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow because", ov::genai::streamer(custom_streamer), ov::genai::max_new_tokens(200));
}
openvino_genai.PerfMetrics
(referred as PerfMetrics
for simplicity) is a structure that holds performance metrics for each generate call. PerfMetrics
holds fields with mean and standard deviations for the following metrics:
- Time To the First Token (TTFT), ms
- Time per Output Token (TPOT), ms/token
- Generate total duration, ms
- Tokenization duration, ms
- Detokenization duration, ms
- Throughput, tokens/s
and:
- Load time, ms
- Number of generated tokens
- Number of tokens in the input prompt
Performance metrics are stored either in the DecodedResults
or EncodedResults
perf_metric
field. Additionally to the fields mentioned above, PerfMetrics
has a member raw_metrics
of type openvino_genai.RawPerfMetrics
(referred to as RawPerfMetrics
for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on PerfMetrics
.
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
perf_metrics = result.perf_metrics
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput()get_.mean():.2f} tokens/s')
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
auto result = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
auto perf_metrics = result.perf_metrics;
std::cout << std::fixed << std::setprecision(2);
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
}
output:
mean_generate_duration: 76.28
mean_ttft: 42.58
mean_tpot 3.80
Note: If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
Several perf_metrics
can be added to each other. In that case raw_metrics
are concatenated and mean/std values are recalculated. This accumulates statistics from several generate()
calls
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics
std::cout << std::fixed << std::setprecision(2);
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
}
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
perf_metrics = res_1.perf_metrics + res_2.perf_metrics
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
For more examples of how metrics are used, please refer to the Python benchmark_genai.py and C++ benchmark_genai samples.
For information on how OpenVINO™ GenAI works, refer to the How It Works Section.
For a list of supported models, refer to the Supported Models Section.