Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi prompt support for beam search #349

Conversation

as-suvorov
Copy link
Contributor

No description provided.

for (int prompt_id = 0; prompt_id < promts_size; prompt_id++) {
const std::vector<int64_t> prompt = parameters.prompts[prompt_id];
std::vector<Group>& groups = prompts_groups[prompt_id];
auto [prompt_next_tokens, prompt_next_beams] = select_prompt_next_tokens(logits, prompt, groups);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we explicitly handle that some prompts can finish earlier than others? e.g. prompt_next_tokens can be empty for some prompts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, attention_mask should be handled differently. I updated implementation, now attention_mask set based on beam_id.

@as-suvorov as-suvorov marked this pull request as ready for review April 10, 2024 11:17
@pavel-esir pavel-esir requested review from pavel-esir and olpipi April 10, 2024 12:13
@pavel-esir pavel-esir mentioned this pull request Apr 11, 2024
7 tasks
}

std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::vector<std::string> prompts) {
tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {prompts.size()}, prompts.data()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note:
currently, we use batched inference with tokenizers and it creates attention mask for us, which we need to "parse" later. Maybe, alternatively, we could tokenize prompt one by one and it will return us raw (unpadded) data, which we can use more optimally to fill position_ids, etc.

I'm not sure which solution is more optimal, so, let's stick to current one, because it's already implemented.

@ilya-lavrenov ilya-lavrenov merged commit e4238b7 into openvinotoolkit:master Apr 11, 2024
10 checks passed
Wovchena added a commit that referenced this pull request Jun 7, 2024
LLM return logits with probabilities of each token, these probabilities
can be converted to tokens/words with different technics: greedy
decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the
simplest scenario of greedy decoding. In order to make live easier we we
combined all decoding scenarios into a single function call, where the
decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired
by `generate` method from HuggingFace transformers library.

- [x] enable calling tokenizers/detokenizers from LLMPipeline
- [ ] add callback for streaming mode - done partially, need to improve
- [x] rewritten samples with the current approach:
[causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
- [x] Multibatch greedy decoding
- [ ] Speculative decoding
- [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
multibatch support after merging
#349
- [x] Random sampling

Example 1: Greedy search generation
```
LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));
```

Example 2: TextStreaming mode
```
LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();
```

CVS-132907 CVS-137920

---------

Co-authored-by: Wovchena <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Alexander Suvorov <[email protected]>
Co-authored-by: Yaroslav Tarkan <[email protected]>
Co-authored-by: Xiake Sun <[email protected]>
Co-authored-by: wenyi5608 <[email protected]>
Co-authored-by: Ekaterina Aidova <[email protected]>
Co-authored-by: guozhong wang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
Wovchena added a commit to Wovchena/openvino.genai-public that referenced this pull request Jun 7, 2024
LLM return logits with probabilities of each token, these probabilities
can be converted to tokens/words with different technics: greedy
decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the
simplest scenario of greedy decoding. In order to make live easier we we
combined all decoding scenarios into a single function call, where the
decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired
by `generate` method from HuggingFace transformers library.

- [x] enable calling tokenizers/detokenizers from LLMPipeline
- [ ] add callback for streaming mode - done partially, need to improve
- [x] rewritten samples with the current approach:
[causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
- [x] Multibatch greedy decoding
- [ ] Speculative decoding
- [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
multibatch support after merging
openvinotoolkit#349
- [x] Random sampling

Example 1: Greedy search generation
```
LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));
```

Example 2: TextStreaming mode
```
LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();
```

CVS-132907 CVS-137920

---------

Co-authored-by: Wovchena <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Alexander Suvorov <[email protected]>
Co-authored-by: Yaroslav Tarkan <[email protected]>
Co-authored-by: Xiake Sun <[email protected]>
Co-authored-by: wenyi5608 <[email protected]>
Co-authored-by: Ekaterina Aidova <[email protected]>
Co-authored-by: guozhong wang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
@Wovchena Wovchena mentioned this pull request Jun 7, 2024
7 tasks
ilya-lavrenov added a commit that referenced this pull request Jun 10, 2024
LLM return logits with probabilities of each token, these probabilities
can be converted to tokens/words with different technics: greedy
decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the
simplest scenario of greedy decoding. In order to make live easier we we
combined all decoding scenarios into a single function call, where the
decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired
by `generate` method from HuggingFace transformers library.

- [x] enable calling tokenizers/detokenizers from LLMPipeline
- [ ] add callback for streaming mode - done partially, need to improve
- [x] rewritten samples with the current approach:
[causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
- [x] Multibatch greedy decoding
- [ ] Speculative decoding
- [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
multibatch support after merging
#349
- [x] Random sampling

Example 1: Greedy search generation
```
LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));
```

Example 2: TextStreaming mode
```
LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();
```

CVS-132907 CVS-137920

---------

Co-authored-by: Wovchena <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Alexander Suvorov <[email protected]>
Co-authored-by: Yaroslav Tarkan <[email protected]>
Co-authored-by: Xiake Sun <[email protected]>
Co-authored-by: wenyi5608 <[email protected]>
Co-authored-by: Ekaterina Aidova <[email protected]>
Co-authored-by: guozhong wang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
iefode added a commit to iefode/openvino.genai that referenced this pull request Jun 11, 2024
commit adec0e0
Author: Irina Efode <[email protected]>
Date:   Tue Jun 11 14:32:45 2024 +0400

    Remove extra token desc

commit a64f30a
Author: Irina Efode <[email protected]>
Date:   Tue Jun 11 13:36:01 2024 +0400

    Working sampler

commit 05048ff
Author: Irina Efode <[email protected]>
Date:   Tue Jun 11 13:23:43 2024 +0400

    check

commit e349418
Merge: bfaa55a 0b1ce98
Author: Irina Efode <[email protected]>
Date:   Mon Jun 10 23:11:58 2024 +0400

    Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into penalties

commit 0b1ce98
Merge: 16d857e 2da1556
Author: Ilya Lavrenov <[email protected]>
Date:   Mon Jun 10 18:52:20 2024 +0400

    Merge pull request openvinotoolkit#21 from iefode/n_support

    Support num_return_seq for multinomial case

commit bfaa55a
Author: Irina Efode <[email protected]>
Date:   Mon Jun 10 17:42:01 2024 +0400

    Fix tests

commit fa0efb6
Author: Irina Efode <[email protected]>
Date:   Mon Jun 10 16:41:04 2024 +0400

    Config tests

commit 7551303
Author: Irina Efode <[email protected]>
Date:   Mon Jun 10 15:34:14 2024 +0400

    Implement LogitTransformers. todo config check

commit 16d857e
Merge: 76148c5 1ee4f38
Author: Ilya Lavrenov <[email protected]>
Date:   Mon Jun 10 10:41:27 2024 +0200

    Merge remote-tracking branch 'upstream/master' into ct-beam-search

commit 1ee4f38
Author: guozhong wang <[email protected]>
Date:   Sun Jun 9 18:26:57 2024 +0800

    Add option --prompt_index (openvinotoolkit#481)

    Run the corresponding prompt according to the option prompt index

commit 9902928
Author: Pavel Esir <[email protected]>
Date:   Fri Jun 7 20:57:47 2024 +0200

    Generate pipeline (openvinotoolkit#334)

    LLM return logits with probabilities of each token, these probabilities
    can be converted to tokens/words with different technics: greedy
    decoding, beam search decoding, random sampling, etc.

    This requires writing user unfriendly post-processing even for the
    simplest scenario of greedy decoding. In order to make live easier we we
    combined all decoding scenarios into a single function call, where the
    decoding method and parameters are specified by arguments.

    In this PR we provide a user friendly API for text generation inspired
    by `generate` method from HuggingFace transformers library.

    - [x] enable calling tokenizers/detokenizers from LLMPipeline
    - [ ] add callback for streaming mode - done partially, need to improve
    - [x] rewritten samples with the current approach:
    [causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
    - [x] Multibatch greedy decoding
    - [ ] Speculative decoding
    - [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
    multibatch support after merging
    openvinotoolkit#349
    - [x] Random sampling

    Example 1: Greedy search generation
    ```
    LLMPipeline pipe(model_path, device);

    // Will try to load config from generation_config.json.
    // but if not found default velues for gready search will be used
    GenerationConfig config = pipe.generation_config();

    cout << pipe(prompt, config.max_new_tokens(20));
    ```

    Example 2: TextStreaming mode
    ```
    LLMPipeline pipe(model_path, device);

    GenerationConfig config = pipe.generation_config();

    auto text_streamer = TextStreamer{pipe};
    auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
        text_streamer.put(tokens[0]);
    };

    pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
    text_streamer.end();
    ```

    CVS-132907 CVS-137920

    ---------

    Co-authored-by: Wovchena <[email protected]>
    Co-authored-by: Ilya Lavrenov <[email protected]>
    Co-authored-by: Alexander Suvorov <[email protected]>
    Co-authored-by: Yaroslav Tarkan <[email protected]>
    Co-authored-by: Xiake Sun <[email protected]>
    Co-authored-by: wenyi5608 <[email protected]>
    Co-authored-by: Ekaterina Aidova <[email protected]>
    Co-authored-by: guozhong wang <[email protected]>
    Co-authored-by: Chen Peter <[email protected]>

commit 2da1556
Author: Irina Efode <[email protected]>
Date:   Thu Jun 6 19:24:45 2024 +0400

    library/src/continuous_batching_pipeline.cpp

commit 7b48fa4
Author: Irina Efode <[email protected]>
Date:   Thu Jun 6 15:03:05 2024 +0400

    enable streaming for greedy

commit 5c601e0
Author: Irina Efode <[email protected]>
Date:   Thu Jun 6 13:29:47 2024 +0400

    Comments

commit 4f73d36
Author: Irina Efode <[email protected]>
Date:   Wed Jun 5 22:46:04 2024 +0400

    Enable frequency and presence penalties

commit 5e49c46
Author: Irina Efode <[email protected]>
Date:   Wed Jun 5 11:56:31 2024 +0400

    Fix python tests

commit eb4a219
Author: Irina Efode <[email protected]>
Date:   Tue Jun 4 22:38:43 2024 +0400

    fix assert place

commit f4d8461
Author: Irina Efode <[email protected]>
Date:   Tue Jun 4 22:22:37 2024 +0400

    Correct accumulation

commit 55448a1
Merge: 1128792 76148c5
Author: Irina Efode <[email protected]>
Date:   Tue Jun 4 18:56:42 2024 +0400

    Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support

commit 1128792
Author: Irina Efode <[email protected]>
Date:   Tue Jun 4 18:52:38 2024 +0400

    test

commit e245041
Author: Irina Efode <[email protected]>
Date:   Tue Jun 4 18:52:03 2024 +0400

    Apply comments

commit 561cde0
Author: guozhong wang <[email protected]>
Date:   Tue Jun 4 16:27:08 2024 +0800

    using sdpa for statble diffusion (openvinotoolkit#458)

    Co-authored-by: Chen Peter <[email protected]>

commit 04510d4
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jun 3 17:37:41 2024 +0000

    Bump optimum[openvino] from 1.19.2 to 1.20.0 in /text_generation/causal_lm/cpp (openvinotoolkit#467)

commit db4a88f
Merge: e5d33f5 b63bda2
Author: Irina Efode <[email protected]>
Date:   Mon Jun 3 13:17:32 2024 +0400

    Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support

commit e5d33f5
Merge: fe29df9 bcdcefc
Author: Irina Efode <[email protected]>
Date:   Fri May 31 14:11:13 2024 +0400

    Merge remote-tracking branch 'ilavrenov_upstream/ct-beam-search' into n_support

commit fe29df9
Author: Irina Efode <[email protected]>
Date:   Fri May 31 14:06:51 2024 +0400

    Tests + Readme

commit 7af72aa
Author: Irina Efode <[email protected]>
Date:   Wed May 29 15:16:23 2024 +0400

    Squashed commit of the following:

    commit 28af66d
    Author: Anastasiia Pnevskaia <[email protected]>
    Date:   Tue May 28 15:40:15 2024 +0200

        Added cache_size to python binding of scheduler config.

    commit 65a793a
    Author: Anastasiia Pnevskaia <[email protected]>
    Date:   Tue May 28 15:12:16 2024 +0200

        Fixed tests.

commit 033558e
Author: Irina Efode <[email protected]>
Date:   Wed May 29 00:40:48 2024 +0400

    One more change

commit dbae0bf
Merge: f992591 2c2799f
Author: Irina Efode <[email protected]>
Date:   Wed May 29 00:38:52 2024 +0400

    Merge master, without py tests

commit a5b14c7
Author: Lyalyushkin Nikolay <[email protected]>
Date:   Tue May 28 16:15:42 2024 +0200

    grammar corrector support in WWB (openvinotoolkit#462)

    This PR introduces support for `AutoForSeq2SeqLM` models in WWB.
    Previously, WWB only supported `AutoForCasualLM`, assuming that the
    `generate` method copies the prompt to the output.
    But AutoForSeq2SeqLM generates output differently: there is no copy of
    the prompt, and it directly generates output.

    The fix was checked on the
    [example](https://gist.github.com/ljaljushkin/5a489a27cd0020ddbd42ea7ae54be688).
    It evaluates grammar correction with Seq2Seq model using WWB.

commit f992591
Author: Irina Efode <[email protected]>
Date:   Tue May 28 17:39:17 2024 +0400

    tmp

commit 7e771f1
Author: Liwenke <[email protected]>
Date:   Tue May 28 15:24:15 2024 +0800

    Note for wikitext data set connection issue (openvinotoolkit#452)

    Co-authored-by: Chen Peter <[email protected]>

commit 24ef06e
Author: guozhong wang <[email protected]>
Date:   Tue May 28 14:23:19 2024 +0800

    Force to generate more tokens (openvinotoolkit#457)

commit 1ed7539
Author: guozhong wang <[email protected]>
Date:   Tue May 28 09:44:45 2024 +0800

    Correct flan-t5 output size (openvinotoolkit#451)

    openvinotoolkit#358

    ---------

    Co-authored-by: Chen Peter <[email protected]>

commit b5a9f28
Author: Irina Efode <[email protected]>
Date:   Mon May 27 23:48:03 2024 +0400

    Extend in beam support

commit edc53e5
Author: Irina Efode <[email protected]>
Date:   Fri May 24 17:59:48 2024 +0400

    remove extra

commit 9038308
Author: Irina Efode <[email protected]>
Date:   Fri May 24 16:20:13 2024 +0400

    Improve multinomial

commit c453e3e
Author: Irina Efode <[email protected]>
Date:   Fri May 24 15:42:48 2024 +0400

    Support num_return_seq for multinomial case

commit e6f05c6
Author: guozhong wang <[email protected]>
Date:   Thu May 23 11:36:50 2024 +0800

    Output median min and avg values to csv (openvinotoolkit#450)

    Co-authored-by: Chen Peter <[email protected]>

commit 25909cc
Author: guozhong wang <[email protected]>
Date:   Thu May 23 11:12:27 2024 +0800

    verify beam search 1st token optimization (openvinotoolkit#426)

    The minimum version of transformers to get 1st and 2nd tokens latency is
    v4.40-release.

commit 03e78fe
Author: Chen Peter <[email protected]>
Date:   Wed May 22 13:06:11 2024 +0800

    Revert "Force to generate "inference count" tokens" (openvinotoolkit#455)

    Reverts openvinotoolkit#289 to unblock the release.
    Since it causes the performance regression of some models. (WIP to
    investigate the reason)

commit 05a0f36
Author: Ekaterina Aidova <[email protected]>
Date:   Tue May 21 20:33:26 2024 +0400

    fix path based configuration (openvinotoolkit#456)

commit 41b07d3
Author: Ekaterina Aidova <[email protected]>
Date:   Fri May 17 06:20:18 2024 +0400

    Fix md5 hash for env that does not support usedforsecurity arg (openvinotoolkit#445)

    I got an error running benchmarking on my working machine (python3.8,
    ubuntu20) due to unsupported args for hashlib.
    ```
    [ ERROR ] An exception occurred
    [ INFO ] Traceback (most recent call last):
      File "benchmark.py", line 532, in main
        iter_data_list, pretrain_time = CASE_TO_BENCH[model_args['use_case']](model_path, framework, args.device, model_args, args.num_iters)
      File "benchmark.py", line 194, in run_text_generation_benchmark
        run_text_generation(input_text, num, model, tokenizer, args, iter_data_list, warmup_md5, prompt_idx, bench_hook, model_precision, proc_id)
      File "benchmark.py", line 131, in run_text_generation
        result_md5_list.append(hashlib.md5(result_text.encode(), usedforsecurity=False).hexdigest())
    TypeError: openssl_md5() takes at most 1 argument (2 given)
    ```
    Based on this [StackOverflow
    issue](https://stackoverflow.com/questions/54717862/how-do-i-know-if-the-usedforsecurity-flag-is-supported-by-hashlib-md5),
    not all clients support this argument and usage hashlib.new("md5") vs
    hashlib.md5 should be safe for usage in both cases

commit d473e96
Author: guozhong wang <[email protected]>
Date:   Fri May 17 10:09:27 2024 +0800

    output no hook data warning when it is text gen model (openvinotoolkit#449)

commit cad3abb
Author: guozhong wang <[email protected]>
Date:   Thu May 16 17:28:49 2024 +0800

    Fix an attempt to add a string value to a numerical value (openvinotoolkit#447)

commit 93f7670
Author: Ekaterina Aidova <[email protected]>
Date:   Thu May 16 11:49:08 2024 +0400

    update optimum intel commit in llm bench (openvinotoolkit#444)

commit d73346c
Author: Yaroslav Tarkan <[email protected]>
Date:   Wed May 15 14:24:30 2024 +0300

    Fix noise images generated for '--num' > 1 in Stable Diffusion sample (openvinotoolkit#441)

    Fixes openvinotoolkit#405
ScottZhang812 pushed a commit to ScottZhang812/_openvino.genai that referenced this pull request Dec 23, 2024
LLM return logits with probabilities of each token, these probabilities
can be converted to tokens/words with different technics: greedy
decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the
simplest scenario of greedy decoding. In order to make live easier we we
combined all decoding scenarios into a single function call, where the
decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired
by `generate` method from HuggingFace transformers library.

- [x] enable calling tokenizers/detokenizers from LLMPipeline
- [ ] add callback for streaming mode - done partially, need to improve
- [x] rewritten samples with the current approach:
[causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
- [x] Multibatch greedy decoding
- [ ] Speculative decoding
- [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
multibatch support after merging
openvinotoolkit/openvino.genai#349
- [x] Random sampling

Example 1: Greedy search generation
```
LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));
```

Example 2: TextStreaming mode
```
LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();
```

CVS-132907 CVS-137920

---------

Co-authored-by: Wovchena <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Alexander Suvorov <[email protected]>
Co-authored-by: Yaroslav Tarkan <[email protected]>
Co-authored-by: Xiake Sun <[email protected]>
Co-authored-by: wenyi5608 <[email protected]>
Co-authored-by: Ekaterina Aidova <[email protected]>
Co-authored-by: guozhong wang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
ScottZhang812 pushed a commit to ScottZhang812/_openvino.genai that referenced this pull request Dec 23, 2024
LLM return logits with probabilities of each token, these probabilities
can be converted to tokens/words with different technics: greedy
decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the
simplest scenario of greedy decoding. In order to make live easier we we
combined all decoding scenarios into a single function call, where the
decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired
by `generate` method from HuggingFace transformers library.

- [x] enable calling tokenizers/detokenizers from LLMPipeline
- [ ] add callback for streaming mode - done partially, need to improve
- [x] rewritten samples with the current approach:
[causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83)
- [x] Multibatch greedy decoding
- [ ] Speculative decoding
- [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase
multibatch support after merging
openvinotoolkit/openvino.genai#349
- [x] Random sampling

Example 1: Greedy search generation
```
LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));
```

Example 2: TextStreaming mode
```
LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();
```

CVS-132907 CVS-137920

---------

Co-authored-by: Wovchena <[email protected]>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Alexander Suvorov <[email protected]>
Co-authored-by: Yaroslav Tarkan <[email protected]>
Co-authored-by: Xiake Sun <[email protected]>
Co-authored-by: wenyi5608 <[email protected]>
Co-authored-by: Ekaterina Aidova <[email protected]>
Co-authored-by: guozhong wang <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants