Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check beam search batch > 1 in ci #1606

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,970 changes: 1,000 additions & 970 deletions .github/workflows/causal_lm_cpp.yml

Large diffs are not rendered by default.

58 changes: 29 additions & 29 deletions .github/workflows/job_vlm_sample_llava.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,32 +14,32 @@ env:
l_u22_ov_link: https://storage.openvinotoolkit.org/repositories/openvino/packages/nightly/2025.1.0-17911-83c047443de/l_openvino_toolkit_ubuntu22_2025.1.0.dev20250116_x86_64.tgz

jobs:
visual_language_chat_sample-ubuntu-llava:
runs-on: ubuntu-22.04-16-cores
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v4
with:
python-version: 3.11
- uses: ./.github/actions/install_openvino
with:
ov_link: ${{ env.l_u22_ov_link }}
- uses: ./.github/actions/build_app
with:
build_target: 'visual_language_chat py_openvino_genai'
- uses: ./.github/actions/install_python_deps
- name: Download and convert model
run: |
source ./ov/setupvars.sh
optimum-cli export openvino --model ${{ inputs.model_id }} ./${{ inputs.model_dir }}
- name: Download images
run: |
wget https://llava-vl.github.io/static/images/monalisa.jpg
- name: Run visual_language_chat C++ sample
run: >
source ./ov/setupvars.sh
&& ./build/samples/cpp/visual_language_chat/visual_language_chat ./${{ inputs.model_dir }} monalisa.jpg
<<< $'Who drew this painting?\nWhen did the painter live?'
timeout-minutes: 4
# visual_language_chat_sample-ubuntu-llava:
# runs-on: ubuntu-22.04-16-cores
# steps:
# - uses: actions/checkout@v4
# with:
# submodules: recursive
# - uses: actions/setup-python@v4
# with:
# python-version: 3.11
# - uses: ./.github/actions/install_openvino
# with:
# ov_link: ${{ env.l_u22_ov_link }}
# - uses: ./.github/actions/build_app
# with:
# build_target: 'visual_language_chat py_openvino_genai'
# - uses: ./.github/actions/install_python_deps
# - name: Download and convert model
# run: |
# source ./ov/setupvars.sh
# optimum-cli export openvino --model ${{ inputs.model_id }} ./${{ inputs.model_dir }}
# - name: Download images
# run: |
# wget https://llava-vl.github.io/static/images/monalisa.jpg
# - name: Run visual_language_chat C++ sample
# run: >
# source ./ov/setupvars.sh
# && ./build/samples/cpp/visual_language_chat/visual_language_chat ./${{ inputs.model_dir }} monalisa.jpg
# <<< $'Who drew this painting?\nWhen did the painter live?'
# timeout-minutes: 4
10 changes: 5 additions & 5 deletions .github/workflows/linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -267,12 +267,12 @@ jobs:
fail-fast: false
matrix:
test:
- name: 'Whisper'
cmd: 'tests/python_tests/test_whisper_pipeline.py tests/python_tests/test_whisper_pipeline_static.py'
- name: 'Cacheopt E2E'
cmd: 'tests/python_tests/test_kv_cache_eviction.py'
# - name: 'Whisper'
# cmd: 'tests/python_tests/test_whisper_pipeline.py tests/python_tests/test_whisper_pipeline_static.py'
# - name: 'Cacheopt E2E'
# cmd: 'tests/python_tests/test_kv_cache_eviction.py'
- name: 'LLM & VLM'
cmd: 'tests/python_tests --ignore tests/python_tests/test_whisper_pipeline.py --ignore tests/python_tests/test_kv_cache_eviction.py --ignore tests/python_tests/test_whisper_pipeline_static.py'
cmd: 'tests/python_tests --ignore tests/python_tests/test_whisper_pipeline.py --ignore tests/python_tests/test_kv_cache_eviction.py --ignore tests/python_tests/test_whisper_pipeline_static.py --ignore tests/python_tests/test_continuous_batching.py --ignore tests/python_tests/test_generation_config.py --ignore tests/python_tests/test_tokenizer.py --ignore tests/python_tests/test_vlm_pipeline.py'
defaults:
run:
shell: bash
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,6 @@ from PIL import Image

# Choose GPU instead of CPU in the line below to run the model on Intel integrated or discrete GPU
pipe = openvino_genai.VLMPipeline("./InternVL2-1B", "CPU")
pipe.start_chat()

image = Image.open("dog.jpg")
image_data = np.array(image.getdata()).reshape(1, image.size[1], image.size[0], 3).astype(np.uint8)
Expand Down
2 changes: 1 addition & 1 deletion samples/cpp/text_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat
./chat_sample <MODEL_DIR>
```
#### Missing chat template
If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`.
The following template can be used as a default, but it may not work properly with every model:
```
"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
Expand Down
2 changes: 1 addition & 1 deletion samples/python/text_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat
python chat_sample.py model_dir
```
#### Missing chat template
If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model.
If you encounter an exception indicating a missing "chat template" when launching the `ov::genai::LLMPipeline` in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model or update it using call `pipe.get_tokenizer().set_chat_template(new_chat_template)`.
The following template can be used as a default, but it may not work properly with every model:
```
"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",
Expand Down
7 changes: 4 additions & 3 deletions samples/python/text_generation/chat_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,16 @@ def main():
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100

pipe.start_chat()
# pipe.start_chat()
while True:
try:
prompt = input('question:\n')
except EOFError:
break
pipe.generate(prompt, config, streamer)
res = pipe.generate(prompt, max_new_tokens=30, apply_chat_template=False)
print(res)
print('\n----------')
pipe.finish_chat()
# pipe.finish_chat()


if '__main__' == __name__:
Expand Down
2 changes: 2 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ output:
'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
```

>**Note**: The chat_template from tokenizer_config.json or from tokenizer/detokenizer model will be automatically applied to the prompt at the generation stage. If you want to disable it, you can do it by calling pipe.get_tokenizer().set_chat_template("").

A simple chat in Python:
```python
import openvino_genai as ov_genai
Expand Down
4 changes: 4 additions & 0 deletions src/cpp/include/openvino/genai/generation_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,8 @@ class OPENVINO_GENAI_EXPORTS GenerationConfig {

std::optional<AdapterConfig> adapters;

bool apply_chat_template = true;

/** @brief sets eos_token_id to tokenizer_eos_token_id if eos_token_id is less than 0.
* Otherwise verifies eos_token_id == tokenizer_eos_token_id.
*/
Expand Down Expand Up @@ -189,6 +191,8 @@ extern OPENVINO_GENAI_EXPORTS ov::Property<size_t> rng_seed;
static constexpr ov::Property<float> assistant_confidence_threshold{"assistant_confidence_threshold"};
static constexpr ov::Property<size_t> num_assistant_tokens{"num_assistant_tokens"};

static constexpr ov::Property<bool> apply_chat_template{"apply_chat_template"};

// Predefined Configs

OPENVINO_DEPRECATED("Please, use individual parameters instead of predefined configs. This method will be removed in 2026.0.0 release")
Expand Down
4 changes: 4 additions & 0 deletions src/cpp/include/openvino/genai/llm_pipeline.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline {
* @param generation_config optional GenerationConfig
* @param streamer optional streamer
* @return DecodedResults decoded resulting text
* chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it.
* To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
*/
DecodedResults generate(
StringInputs inputs,
Expand All @@ -191,6 +193,8 @@ class OPENVINO_GENAI_EXPORTS LLMPipeline {
* @param inputs input prompt or a vector of prompts
* @param properties properties
* @return DecodedResults decoded resulting text
* chat_template will be applied to the prompt, run pipe.get_tokenizer().set_chat_template(custom_chat_template) to update it.
* To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
*/
template <typename... Properties>
util::EnableIfAllStringAny<DecodedResults, Properties...> generate(
Expand Down
3 changes: 3 additions & 0 deletions src/cpp/include/openvino/genai/tokenizer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,9 @@ class OPENVINO_GENAI_EXPORTS Tokenizer {
/// @param chat_template The new template to override with.
void set_chat_template(const std::string& chat_template);

// get information about a chat template to check its status, for example whether it is empty
std::string get_chat_template() const;

// information about <bos>, <eos> tokens should be public,
// they are used at least in StreamerBase descendants
int64_t get_bos_token_id() const;
Expand Down
8 changes: 8 additions & 0 deletions src/cpp/include/openvino/genai/visual_language/pipeline.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/// @param generation_config A config to follow for text generation.
/// @param streamer A streamer to acquire intermediate result.
/// @return A string generated by a model.
/// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
/// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
VLMDecodedResults generate(
const std::string& prompt,
const std::vector<ov::Tensor>& rgbs,
Expand All @@ -111,6 +113,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/// @param generation_config A config to follow for text generation.
/// @param streamer A streamer to acquire intermediate result.
/// @return A string generated by a model.
/// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
/// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
VLMDecodedResults generate(
const std::string& prompt,
const ov::Tensor& rgb,
Expand All @@ -124,6 +128,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/// for its members, StreamerVariant a single image or multiple
/// images.
/// @return A string generated by a model.
/// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
/// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
VLMDecodedResults generate(
const std::string& prompt,
const ov::AnyMap& config_map
Expand All @@ -137,6 +143,8 @@ class OPENVINO_GENAI_EXPORTS VLMPipeline {
/// @param ...properties ov::Property instances to be combined into
/// ov::AnyMap.
/// @return A string generated by a model.
/// chat_template will be applied to the prompt, run pipe.set_chat_template(custom_chat_template) to update it.
/// To disable it for non-chat mode, please, use custom_chat_template eq "" or set generation_config.apply_chat_template to false.
template <typename... Properties>
util::EnableIfAllStringAny<VLMDecodedResults, Properties...> generate(
const std::string& prompt,
Expand Down
2 changes: 2 additions & 0 deletions src/cpp/include/openvino/genai/whisper_generation_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ class OPENVINO_GENAI_EXPORTS WhisperGenerationConfig : public GenerationConfig {
// A list containing the non-speech tokens that will be suppressed during generation.
std::vector<int64_t> suppress_tokens;

bool apply_chat_template = false;

void update_generation_config(const ov::AnyMap& config_map = {});

template <typename... Properties>
Expand Down
2 changes: 1 addition & 1 deletion src/cpp/src/debug_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
template <typename T>
void print_array(T * array, size_t size) {
std::cout << " => [ ";
for (size_t i = 0; i < std::min(size, size_t(10)); ++i) {
for (size_t i = 0; i < size; ++i) {
std::cout << array[i] << " ";
}
std::cout << " ] " << std::endl;
Expand Down
1 change: 1 addition & 0 deletions src/cpp/src/generation_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ void GenerationConfig::update_generation_config(const ov::AnyMap& properties) {
read_anymap_param(properties, "logprobs", logprobs);
read_anymap_param(properties, "num_return_sequences", num_return_sequences);
read_anymap_param(properties, "adapters", adapters);
read_anymap_param(properties, "apply_chat_template", apply_chat_template);

// penalties
read_anymap_param(properties, "frequency_penalty", frequency_penalty);
Expand Down
21 changes: 19 additions & 2 deletions src/cpp/src/icontinuous_batching.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
// SPDX-License-Identifier: Apache-2.0

#include "icontinuous_batching.hpp"
#include "debug_utils.hpp"
#include "openvino/genai/tokenizer.hpp"

namespace ov::genai {

Expand Down Expand Up @@ -53,9 +55,22 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
} else {
input_ids.reserve(prompts.size());
timer.start();
for (const std::string& prompt : prompts) {
for (size_t i = 0; i < prompts.size(); i++) {
const std::string& prompt = prompts.at(i);
const auto encode_start = std::chrono::steady_clock::now();
input_ids.push_back(m_tokenizer.encode(prompt).input_ids);
ov::Tensor encoded_inputs;
if (sampling_params.at(i).apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
ChatHistory history({{{"role", "user"}, {"content", prompt}}});
constexpr bool add_generation_prompt = true;
auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
encoded_inputs = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false)).input_ids;
} else {
// in case when chat_template was not found in tokenizer_config.json or set
std::string str_input(prompt);
encoded_inputs = m_tokenizer.encode(str_input, ov::genai::add_special_tokens(true)).input_ids;
}
print_tensor("encoded_inputs", encoded_inputs);
input_ids.push_back(encoded_inputs);
tokenization_durations.emplace_back(PerfMetrics::get_microsec(std::chrono::steady_clock::now() - encode_start));
}
timer.end();
Expand All @@ -71,6 +86,8 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
auto& raw_counters = perf_metrics.raw_metrics;
raw_counters.tokenization_durations.emplace_back(tokenization_durations[i]);

print_array(res.m_generation_ids.at(0).data(), res.m_generation_ids.at(0).size());

std::vector<std::string> generated;
generated.reserve(res.m_generation_ids.size());
for (size_t idx = 0; idx < res.m_generation_ids.size(); ++idx) {
Expand Down
34 changes: 31 additions & 3 deletions src/cpp/src/llm_pipeline_stateful.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
#include "text_callback_streamer.hpp"
#include "utils.hpp"

#include "debug_utils.hpp"

namespace ov::genai {

StatefulLLMPipeline::StatefulLLMPipeline(
Expand Down Expand Up @@ -88,7 +90,21 @@ DecodedResults StatefulLLMPipeline::generate(

if (auto input_vector = std::get_if<std::vector<std::string>>(&inputs)) {
OPENVINO_ASSERT(!is_chat_conversation, "Can't chat with multiple prompts");
encoded_input = m_tokenizer.encode(*input_vector);
if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
std::cout << " input_vector apply_chat_template true " << std::endl;
std::vector<std::string> templated_input_vector;
for (auto& input : *input_vector) {
ChatHistory history({{{"role", "user"}, {"content", input}}});
constexpr bool add_generation_prompt = true;
auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
templated_input_vector.push_back(templated_prompt);
}
encoded_input = m_tokenizer.encode(templated_input_vector, ov::genai::add_special_tokens(false));
} else {
std::cout << " input_vector apply_chat_template false " << std::endl;
encoded_input = m_tokenizer.encode(*input_vector, ov::genai::add_special_tokens(true));
}
print_tensor("encoded_input", encoded_input.input_ids);
} else if (auto input_prompt = std::get_if<std::string>(&inputs)) {
std::string& prompt = *input_prompt;

Expand All @@ -104,7 +120,7 @@ DecodedResults StatefulLLMPipeline::generate(

m_history.push_back({{"role", "user"}, {"content", prompt}});
constexpr bool add_generation_prompt = true;
auto new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
auto new_templated_chat_history = m_tokenizer.apply_chat_template(m_history, add_generation_prompt);
// Do not add special tokens in chat scenario to be aligned with HF.
auto new_chat_tokens = m_tokenizer.encode(new_templated_chat_history, ov::genai::add_special_tokens(false));
auto prev_chat_tokens = m_tokenizer.encode(m_templated_chat_history, ov::genai::add_special_tokens(false));
Expand Down Expand Up @@ -157,7 +173,19 @@ DecodedResults StatefulLLMPipeline::generate(

// TODO: Forbid LoRA config change if we are in the chat mode, because it requires regenerating the history with LoRA applied
} else {
encoded_input = m_tokenizer.encode(prompt);
std::string& prompt = *input_prompt;
if (config.apply_chat_template && !m_tokenizer.get_chat_template().empty()) {
std::cout << " apply_chat_template true " << std::endl;
ChatHistory history({{{"role", "user"}, {"content", prompt}}});
constexpr bool add_generation_prompt = true;
auto templated_prompt = m_tokenizer.apply_chat_template(history, add_generation_prompt);
encoded_input = m_tokenizer.encode(templated_prompt, ov::genai::add_special_tokens(false));
} else {
// in case when chat_template was not found in tokenizer_config.json or set
std::cout << " apply_chat_template false 1" << std::endl;
encoded_input = m_tokenizer.encode(prompt, ov::genai::add_special_tokens(true));
}
print_tensor("encoded_input", encoded_input.input_ids);
}
}

Expand Down
Loading
Loading