You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this interesting work. I have the following question after reading it:
In the observation experiment, the hit rate is computed by:
the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation.
Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?
The text was updated successfully, but these errors were encountered:
Thanks for this interesting work. I have the following question after reading it:
In the observation experiment, the hit rate is computed by:
Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?
The text was updated successfully, but these errors were encountered: