Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM tech report sections 3.1, 3.4, 3.5 #15110

Merged
merged 9 commits into from
Dec 25, 2024
Merged

Conversation

cglagovichTT
Copy link
Contributor

No description provided.


Total throughput (tokens per second) tells us the total number of tokens that the model can generate per second. `total throughput = batch size / decode step latency`. Total throughput is important for cost-sensitive systems or offline processing, where interactivity is less important than throughput. Generally, increasing batch size will increase total throughput.

User throughput (tokens per second per user) is calculate as `user throughput = 1 / decode step latency`. User throughput tells us how interactive the model is, and tells us how fast the generation is for a single user. Generally, decreasing batch size will increase user throughput.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is calculate" -> "is calculated". Another minor / optional comment, it might help with readability to have each metric be a separate bullet point.

tech_reports/LLMs/llms.md Show resolved Hide resolved
tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved
tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved
Tenstorrent maintains a [fork of vLLM](https://github.com/tenstorrent/vllm/tree/dev) for serving models on Tenstorrent hardware. The [README](https://github.com/tenstorrent/vllm/tree/dev/tt_metal) has instructions for setting up the environment.

#### Implementation Requirements
In order to add vLLM support to a new model, the model must conform to a certain interface. An example of the interface is the [Llama2-70b generation code](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_generation.py), which implements `prefill_forward`, `decode_forward`, and `initialize_vllm_model`. Beyond implementing the functionality needed for continuous batching, a model must also implement paged attention. For an example, see [Llama2-70b attention](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_attention_optimized.py).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also require the cache_path property and different forward calls for trace (although the forward call APIs may change in the near future).

Comment on lines 136 to 138
On the vLLM side there may be additional changes needed to support the new model.

Modify the [`tt_loader.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/model_executor/model_loader/tt_loader.py) if the model requires a different initialization. Modify [`tt_model_runner.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/worker/tt_model_runner.py) if it is missing functionality for the new model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note - we should be a little careful about external contributions since we don't yet have branch protections. We should perhaps later add a contributing note/guide to vLLM as well.

@mtairum mtairum assigned mtairum and cglagovichTT and unassigned mtairum Nov 18, 2024
@mtairum mtairum self-requested a review November 18, 2024 18:28
@mtairum mtairum force-pushed the cglagovich/tech_report_3_1 branch from 69b787b to d141ec6 Compare November 21, 2024 17:23
Copy link
Contributor

@mtairum mtairum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if my review commit diff introduced anything wrong.

I've changed links to tt-metal to point to internal files instead of weblink.

I'm not opposed at the basic level explanation of these section. They are short so don't detract much compared to other sections of the doc. When we have the full doc ready, we need to give a new read and update if it makes sense then.

One other thing here is that you link a lot to other source codes. While I do agree that it makes sense and makes this section less busy, other sections I've review so far opt for having more short snippets in the doc. I think both work and so leaving it like this.

- position ids: the position of the tokens in the sequence
- KV cache: an inference optimization that caches intermediate values

In the model, tokens are embedded from the vocabulary space to the embedding space. Position ids are necessary for updating the KV cache and for positional embeddings like RoPE [TODO: Refer to the RoPE section].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After RoPE section is in we need to update this link.

tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved
tech_reports/LLMs/llms.md Show resolved Hide resolved
tech_reports/LLMs/llms.md Show resolved Hide resolved
tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved
@uaydonat uaydonat self-requested a review December 19, 2024 02:23
@cglagovichTT cglagovichTT force-pushed the cglagovich/tech_report_3_1 branch from c84b9b9 to e43bd86 Compare December 25, 2024 03:55
@cglagovichTT cglagovichTT marked this pull request as ready for review December 25, 2024 03:57
@cglagovichTT cglagovichTT merged commit 8e113d0 into main Dec 25, 2024
9 checks passed
@cglagovichTT cglagovichTT deleted the cglagovich/tech_report_3_1 branch December 25, 2024 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants