-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prompt lookup decoding #379
Add prompt lookup decoding #379
Conversation
@Wovchena , we have a proposal for optimizing kv cache trimm from @sammysun0711: sammysun0711@d7a24e5, based on parallel for. |
…as/prompt_lookup_decoding
|
||
// cut redundant candidates on last iteration | ||
size_t tokens_to_generate = max_sequence_length - seq_len; | ||
if (candidates.size() > tokens_to_generate - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we can unconditionally call resize
, because in case of the same size it does nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like: candidates.resize(std::min(candidates.size(), tokens_to_generate - 1));
?
return new_tensor; | ||
} | ||
|
||
void update_kv_cache(ov::InferRequest request, uint64_t seq_len_axis, uint64_t new_seq_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena , we have a proposal for optimizing kv cache trimm from @sammysun0711: sammysun0711@d7a24e5, based on parallel for. It could give 3x speed up for cache update. Should we apply it as well?
Is there a way to link with tbb from openvino package? @ilya-lavrenov, do you know a way? If yes, feel free to apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see custom operations as example https://github.com/openvinotoolkit/openvino_contrib/blob/master/modules/custom_operations/user_ie_extensions/CMakeLists.txt#L20
ov::parallel_for
is used there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-lavrenov , @Wovchena , Don't you mind if I address remaining comments in the next PRs?
- Apply
parallel_for
optimization for trim tensor - Apply optimized trim tensor implementation for
speculative_decoding
- Investigate
candidates_size + 1
inference forspeculative_decoding
Ticket: 138549