Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why not use the last token for kv cache compression #25

Open
Arist12 opened this issue Dec 1, 2024 · 0 comments
Open

why not use the last token for kv cache compression #25

Arist12 opened this issue Dec 1, 2024 · 0 comments

Comments

@Arist12
Copy link

Arist12 commented Dec 1, 2024

Thanks for this interesting work. I have the following question after reading it:

In the observation experiment, the hit rate is computed by:

the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation.

Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant