-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM tech report sections 3.1, 3.4, 3.5 #15110
Conversation
tech_reports/LLMs/llms.md
Outdated
|
||
Total throughput (tokens per second) tells us the total number of tokens that the model can generate per second. `total throughput = batch size / decode step latency`. Total throughput is important for cost-sensitive systems or offline processing, where interactivity is less important than throughput. Generally, increasing batch size will increase total throughput. | ||
|
||
User throughput (tokens per second per user) is calculate as `user throughput = 1 / decode step latency`. User throughput tells us how interactive the model is, and tells us how fast the generation is for a single user. Generally, decreasing batch size will increase user throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is calculate" -> "is calculated". Another minor / optional comment, it might help with readability to have each metric be a separate bullet point.
tech_reports/LLMs/llms.md
Outdated
Tenstorrent maintains a [fork of vLLM](https://github.com/tenstorrent/vllm/tree/dev) for serving models on Tenstorrent hardware. The [README](https://github.com/tenstorrent/vllm/tree/dev/tt_metal) has instructions for setting up the environment. | ||
|
||
#### Implementation Requirements | ||
In order to add vLLM support to a new model, the model must conform to a certain interface. An example of the interface is the [Llama2-70b generation code](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_generation.py), which implements `prefill_forward`, `decode_forward`, and `initialize_vllm_model`. Beyond implementing the functionality needed for continuous batching, a model must also implement paged attention. For an example, see [Llama2-70b attention](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_attention_optimized.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also require the cache_path property and different forward calls for trace (although the forward call APIs may change in the near future).
tech_reports/LLMs/llms.md
Outdated
On the vLLM side there may be additional changes needed to support the new model. | ||
|
||
Modify the [`tt_loader.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/model_executor/model_loader/tt_loader.py) if the model requires a different initialization. Modify [`tt_model_runner.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/worker/tt_model_runner.py) if it is missing functionality for the new model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note - we should be a little careful about external contributions since we don't yet have branch protections. We should perhaps later add a contributing note/guide to vLLM as well.
69b787b
to
d141ec6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if my review commit diff introduced anything wrong.
I've changed links to tt-metal to point to internal files instead of weblink.
I'm not opposed at the basic level explanation of these section. They are short so don't detract much compared to other sections of the doc. When we have the full doc ready, we need to give a new read and update if it makes sense then.
One other thing here is that you link a lot to other source codes. While I do agree that it makes sense and makes this section less busy, other sections I've review so far opt for having more short snippets in the doc. I think both work and so leaving it like this.
tech_reports/LLMs/llms.md
Outdated
- position ids: the position of the tokens in the sequence | ||
- KV cache: an inference optimization that caches intermediate values | ||
|
||
In the model, tokens are embedded from the vocabulary space to the embedding space. Position ids are necessary for updating the KV cache and for positional embeddings like RoPE [TODO: Refer to the RoPE section]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After RoPE section is in we need to update this link.
c84b9b9
to
e43bd86
Compare
No description provided.