Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For our eval gauntlet, modify the prompt to place "thinking placeholder tokens" (
.
or something) at the beginning and end of the prompt.prompt: <prompt>, response: <response>
.
characters).prompt: <prompt>, response: <response>
.
characters)propmt: <prompt>, .response: <response>
Measure: eval gauntlet performance
What I expect could happen (pre-registering)
The thinking tokens in the beginning of the sequence are used to "decompress" the model into the kv cache, which normally takes place in the background over a large number of tokens, and wouldn't take place at all if the prompt is too small. Since bigger context = more flops, adding thinking tokens to the beginning of the context will cause increased performance. However, adding thinking tokens to the middle of the context, where they benefit from seeing the prompt, allows the model to inflate its kv metamodel with more relevant data. I'll call the before-prompt thinking tokens decompression tokens and the the after-prompt thinking tokens planning tokens.
The effect often attributed to planning tokens actually is due to decompression tokens. Flops/output token is the dominant bit in determining output token quality.
.
token): Treatment A > control > treatment BIn particular, separating the question and the answer with many dividing tokens will cause positional embedding or alibi or whatever to mess up.
The consequences of 1 and 2 would be that you can "grow" a small model (with no data at all) to make it more capable.
Experiment YAML
Results