-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OLD] New PR: #35029. [[Universal Speculative Decoding CandidateGenerator
]]
#34760
[OLD] New PR: #35029. [[Universal Speculative Decoding CandidateGenerator
]]
#34760
Conversation
https://arxiv.org/pdf/2404.09492 Found this paper that attempts to align different vocabularies that works across different LLM families and then creates a projection matrix that projects the different LLM outputs to the same embedding domain. |
Thanks for sharing, Gaurav. My takeaways from the paper:
|
I have run similar analysis on the model pairs used in Universal Assisted Generation blog. Last columns in the table below represent respectively the overlap percentage of draft vocab w.r.t. to target vocab and the overlap percentage of target vocab w.r.t. to draft vocab.
|
candidate_ids = assistant_output.sequences | ||
device = candidate_ids.device | ||
candidate_ids = candidate_ids.cpu() | ||
candidate_ids.apply_(lambda x: self._assistant_to_target_input_ids[x]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix a bug here, some x values are missing from the _assistant_to_target_input_ids
dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might indicate a deeper problem because the drafter shouldn't generate tokens that are not in self._assistant_to_target_input_ids
(representing the intersection between the vocabularies). I expect the suppress processor to zero out the probability of generating such tokens, but I might have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug occurs when some prompt draft token ids are missing from _assistant_to_target_input_ids
. I believe I already addressed this bug in the get_target_ids
function.
Seeing this error on the latest commit to
This is when using the following models -
|
Update: I added caching and some tests. The current issue is the dimensions of the output logits. |
Update: I pulled all the recent commits and run the same example as above - using
This looks like expected because the vocab size for Some debugging tells me this a valid error if the |
c8def75
to
f163339
Compare
Thanks @gauravjain14. |
@gauravjain14
|
To make this easier to review, I've split off a smaller PR (#35009) that focuses purely on refactoring the existing code, without introducing the new Universal SD features. The refactor aims to:
@ArthurZucker, @gante, I’d love your feedback and review when you have a moment. Thanks so much! |
CandidateGenerator
CandidateGenerator
CandidateGenerator
bdff66d
to
3e23690
Compare
CandidateGenerator
CandidateGenerator
~~ New PR: #35029
CandidateGenerator
~~ New PR: #35029CandidateGenerator
]]
This branch has diverged from Thanks @gauravjain14 for spotting it. |
Please see the new PR: #35029
This PR is ready for initial review, though some aspects are still a work-in-progress.What does this PR do?This PR introduces theUniversalSpeculativeDecodingGenerator
class, enabling speculative decoding for assistants with slightly different tokenizers. The key addition is two logits processors (LogitsProcessor
) that ensure the assistant generates tokens exclusively from the target vocabulary, maintaining alignment and preserving the target distribution without altering the verification method. Theoretically, it is agnostic to thedo_sample
choice. This avoids issues like #32867 and #33534 and sets the stage for advanced universal speculative decoding techniques (that we are currently researching and have not yet been published).Motivation and ContextThis update resolves prior inconsistencies in speculative decoding caused by misaligned vocabularies. Key benefits include:Ensuring the assistant generates only tokens present in the target vocabulary.Lossless preservation of the target distribution.Compatibility with future speculative decoding advancements.This PR is a step toward advancements in Universal Assisted Generation, in collaboration with @danielkorat, @orenpereg, @mosheber, @jmamou, @gante, @lewtun, and @MosheWasserb.Related
Issues:
PRs:
test_generated_length_assisted_generation
#34935 - please review and mergeDependenciesNo additional dependencies.Before Submitting ChecklistFollowed the [contributor guidelines](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request).Add functionality tests.Verified adherence to target distribution accuracy.Please merge Speculative decoding: Test the target distribution (to prevent issues like #32867) #34553.Add more tests.Measure speedups and add documentation.Who can review?Speculative decoding and generation: @ganteTokenizer alignment: @ArthurZucker