Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dynamic cache for sliding window attention #34352

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

Cyrilvallez
Copy link
Member

@Cyrilvallez Cyrilvallez commented Oct 23, 2024

What does this PR do?

This supersedes #33619 to introduce a new DynamicSlidingWindowCache.

This cache behaves the exact same way as DynamicCache, and contrarily to SlidingWindowCache (static variant), we can correctly continue generation from an existing cache instance (fully filled or not), with more than 1 new token added to the sequence (e.g. for prefix caching).

In order for this to work I had to do the following main modifications:

  • I introduced get_past_seen_tokens(), which should replace get_seq_length() (almost) everywhere. This is because once the cache is filled, the 2 will not longer provide the same information, so we need to correctly differentiate to recreate the cache_position
  • for every cache class except DynamicSlidingWindowCache, both get_past_seen_tokens and get_seq_length will return the same value, so no issue there
  • For every generative model using sliding window, I modified the 4d causal mask creation for the new cache class for both sdpa and eager, and correctly sliced the 2d mask for FA2

Once generate no longer returns legacy cache by default (tuples), we can make this class the default for generative models with sliding window. We cannot do it before because we would lose the information of the past seen tokens if the cache is full.

It also makes a nice precedent to support the same features with the static variant in a subsequent PR.

@Cyrilvallez Cyrilvallez force-pushed the sliding-window branch 4 times, most recently from 456eda6 to 21b472c Compare October 24, 2024 17:23
@Cyrilvallez Cyrilvallez marked this pull request as ready for review October 24, 2024 18:54
@Cyrilvallez Cyrilvallez changed the title Sliding window New dynamic cache for sliding window attention Oct 24, 2024
@Cyrilvallez
Copy link
Member Author

Ok @ArthurZucker it has been a long time on the side but it's now ready for final review!

@Cyrilvallez
Copy link
Member Author

Slow tests for Mistral are all good (4 failing but similar on main)

@ArthurZucker
Copy link
Collaborator

Hey! As we discussed offline, not super super sure we need this. The big blocker for me is that you are changing a lot of model forward, which is something we try to avoid !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants