Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prompt lookup decoding #379

Merged
merged 15 commits into from
Apr 29, 2024

Conversation

as-suvorov
Copy link
Contributor

@as-suvorov as-suvorov commented Apr 23, 2024

Ticket: 138549

@as-suvorov as-suvorov requested a review from Wovchena April 23, 2024 17:16
@as-suvorov as-suvorov marked this pull request as draft April 24, 2024 11:48
@as-suvorov as-suvorov requested a review from Wovchena April 25, 2024 10:22
@as-suvorov as-suvorov marked this pull request as ready for review April 25, 2024 10:22
@as-suvorov
Copy link
Contributor Author

@Wovchena , we have a proposal for optimizing kv cache trimm from @sammysun0711: sammysun0711@d7a24e5, based on parallel for.
It could give 3x speed up for cache update.
Should we apply it as well?


// cut redundant candidates on last iteration
size_t tokens_to_generate = max_sequence_length - seq_len;
if (candidates.size() > tokens_to_generate - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we can unconditionally call resize, because in case of the same size it does nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like: candidates.resize(std::min(candidates.size(), tokens_to_generate - 1)); ?

return new_tensor;
}

void update_kv_cache(ov::InferRequest request, uint64_t seq_len_axis, uint64_t new_seq_len) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wovchena , we have a proposal for optimizing kv cache trimm from @sammysun0711: sammysun0711@d7a24e5, based on parallel for. It could give 3x speed up for cache update. Should we apply it as well?

Is there a way to link with tbb from openvino package? @ilya-lavrenov, do you know a way? If yes, feel free to apply.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov , @Wovchena , Don't you mind if I address remaining comments in the next PRs?

  1. Apply parallel_for optimization for trim tensor
  2. Apply optimized trim tensor implementation for speculative_decoding
  3. Investigate candidates_size + 1 inference for speculative_decoding

text_generation/causal_lm/cpp/README.md Outdated Show resolved Hide resolved
@Wovchena Wovchena merged commit 27083bd into openvinotoolkit:master Apr 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants