You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to enable speculative decoding for non-greedy contexts. Doing this properly involves multiple inconvenient changes to the speculator and speculative_generate code. These changes include having the speculator return logit distributions for each candidate in addition to raw tokens (and using these for acceptance/rejection decisions in speculative_generate), and introducing sampling into the speculator's candidate tree formation process (with the support for temperature, top-k, etc. that that entails).
We desire a simple v0 speculative sampling scheme that does not involve introducing any changes to the speculator itself. A naive approach is: for a given candidate, sample a ground-truth output from the base model (instead of using the greedy output as ground-truth), and then continue to use exact-matching as the acceptance criterion. However, this approach has three main problems:
As temperature increases, acceptance rates go down proportionately. We are willing to bite the bullet on this one for v0 - simply decrease top-k to limit the impact of increased temperature.
In multi-candidate cases, this does not match the base model's output distribution. Consider a simple example of 4 candidates of length 2:
a b
a c
a d
e f
If the model accepts tokens a and e with equal 50% probability, then the odds of accepting a candidate that starts with e is 50% (ignoring the second token). However, the odds of accepting a candidate that starts with a (again ignoring the second token) is 1-.5^3=87.5%. Thus the output distribution is affected by speculator behavior - an undesirable property. Note that this issue also affects the canonical speculative (rejection) sampling approach we are considering for v1.
We can fix the above problem by taking only the candidate with the longest expected length based on base model scores, and then performing per-token comparisons. However, this leads to inefficiency. Consider the following candidates:
a b
c d
e f
g h
If the base model accepts tokens a, c, e, g with equal 25% probability, then the above procedure will accept the first token of the chosen candidate 25% of the time. But this is a problem, because we know that one of our candidates had to have gotten the first token correct - there are only 4 options! So by ignoring "suboptimal" candidates, we've reduced our acceptance rate for the first token by a factor of 4.
The fundamental tension between issues 2 and 3 is that there does not exist a single "ground truth" sampled output from the base model, as base model outputs depend on the speculated input tokens from each candidate. Issue 2 arises because each candidate is comparing against a different sampled output from the base model, while issue 3 arises because we are forcing a single, suboptimal comparison against a single sampled output.
A solution is to implement a version of multinomial sampling that guarantees the same output for the same probability distribution at a given time step. We can imagine an output probability distribution as a number line partitioned into segments, with the size of segment i corresponding to the predicted likelihood of token i. At a given time step we sample a single location on this number line, and for any given probability distribution, the segment that we land inside of is the chosen token. This sampling method guarantees that for two candidates with identical prefixes (and thus identical output distributions), the sampled behavior will be consistent across the candidates, because we are only sampling a single location.
Implemented with this sampling method, we can fix the naive approach (exact matching against sampled output for each candidate) without incurring either issue 2 or 3. It also mitigates issue 1 somewhat, by solving issue 3 (we can compensate for reduced acceptance rates by increasing candidates). Sampled outputs will be different when candidates are different, and the same when candidates are the same. Then we can continue to pick the longest accepted candidate for maximum throughput, while matching base model behavior.
The text was updated successfully, but these errors were encountered:
We want to enable speculative decoding for non-greedy contexts. Doing this properly involves multiple inconvenient changes to the speculator and speculative_generate code. These changes include having the speculator return logit distributions for each candidate in addition to raw tokens (and using these for acceptance/rejection decisions in speculative_generate), and introducing sampling into the speculator's candidate tree formation process (with the support for temperature, top-k, etc. that that entails).
We desire a simple v0 speculative sampling scheme that does not involve introducing any changes to the speculator itself. A naive approach is: for a given candidate, sample a ground-truth output from the base model (instead of using the greedy output as ground-truth), and then continue to use exact-matching as the acceptance criterion. However, this approach has three main problems:
As temperature increases, acceptance rates go down proportionately. We are willing to bite the bullet on this one for v0 - simply decrease top-k to limit the impact of increased temperature.
In multi-candidate cases, this does not match the base model's output distribution. Consider a simple example of 4 candidates of length 2:
If the model accepts tokens
a
ande
with equal 50% probability, then the odds of accepting a candidate that starts withe
is 50% (ignoring the second token). However, the odds of accepting a candidate that starts witha
(again ignoring the second token) is1-.5^3=87.5%
. Thus the output distribution is affected by speculator behavior - an undesirable property. Note that this issue also affects the canonical speculative (rejection) sampling approach we are considering for v1.If the base model accepts tokens
a, c, e, g
with equal 25% probability, then the above procedure will accept the first token of the chosen candidate 25% of the time. But this is a problem, because we know that one of our candidates had to have gotten the first token correct - there are only 4 options! So by ignoring "suboptimal" candidates, we've reduced our acceptance rate for the first token by a factor of 4.The fundamental tension between issues 2 and 3 is that there does not exist a single "ground truth" sampled output from the base model, as base model outputs depend on the speculated input tokens from each candidate. Issue 2 arises because each candidate is comparing against a different sampled output from the base model, while issue 3 arises because we are forcing a single, suboptimal comparison against a single sampled output.
A solution is to implement a version of multinomial sampling that guarantees the same output for the same probability distribution at a given time step. We can imagine an output probability distribution as a number line partitioned into segments, with the size of segment
i
corresponding to the predicted likelihood of tokeni
. At a given time step we sample a single location on this number line, and for any given probability distribution, the segment that we land inside of is the chosen token. This sampling method guarantees that for two candidates with identical prefixes (and thus identical output distributions), the sampled behavior will be consistent across the candidates, because we are only sampling a single location.Implemented with this sampling method, we can fix the naive approach (exact matching against sampled output for each candidate) without incurring either issue 2 or 3. It also mitigates issue 1 somewhat, by solving issue 3 (we can compensate for reduced acceptance rates by increasing candidates). Sampled outputs will be different when candidates are different, and the same when candidates are the same. Then we can continue to pick the longest accepted candidate for maximum throughput, while matching base model behavior.
The text was updated successfully, but these errors were encountered: