why not use the last token for kv cache compression #25

Arist12 · 2024-12-01T06:12:31Z

Thanks for this interesting work. I have the following question after reading it:

In the observation experiment, the hit rate is computed by:

the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation.

Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why not use the last token for kv cache compression #25

why not use the last token for kv cache compression #25

Arist12 commented Dec 1, 2024

why not use the last token for kv cache compression #25

why not use the last token for kv cache compression #25

Comments

Arist12 commented Dec 1, 2024