LLM tech report sections 3.1, 3.4, 3.5 #15110

cglagovichTT · 2024-11-15T19:09:50Z

No description provided.

skhorasganiTT · 2024-11-18T16:33:48Z

tech_reports/LLMs/llms.md

+
+Total throughput (tokens per second) tells us the total number of tokens that the model can generate per second. `total throughput = batch size / decode step latency`. Total throughput is important for cost-sensitive systems or offline processing, where interactivity is less important than throughput. Generally, increasing batch size will increase total throughput.
+
+User throughput (tokens per second per user) is calculate as `user throughput = 1 / decode step latency`. User throughput tells us how interactive the model is, and tells us how fast the generation is for a single user. Generally, decreasing batch size will increase user throughput. 


"is calculate" -> "is calculated". Another minor / optional comment, it might help with readability to have each metric be a separate bullet point.

tech_reports/LLMs/llms.md

skhorasganiTT · 2024-11-18T16:51:45Z

tech_reports/LLMs/llms.md

+Tenstorrent maintains a [fork of vLLM](https://github.com/tenstorrent/vllm/tree/dev) for serving models on Tenstorrent hardware. The [README](https://github.com/tenstorrent/vllm/tree/dev/tt_metal) has instructions for setting up the environment.
+
+#### Implementation Requirements
+In order to add vLLM support to a new model, the model must conform to a certain interface. An example of the interface is the [Llama2-70b generation code](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_generation.py), which implements `prefill_forward`, `decode_forward`, and `initialize_vllm_model`. Beyond implementing the functionality needed for continuous batching, a model must also implement paged attention. For an example, see [Llama2-70b attention](https://github.com/tenstorrent/tt-metal/blob/main/models/demos/t3000/llama2_70b/tt/llama_attention_optimized.py).


We also require the cache_path property and different forward calls for trace (although the forward call APIs may change in the near future).

skhorasganiTT · 2024-11-18T16:54:46Z

tech_reports/LLMs/llms.md

+On the vLLM side there may be additional changes needed to support the new model.
+
+Modify the [`tt_loader.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/model_executor/model_loader/tt_loader.py) if the model requires a different initialization. Modify [`tt_model_runner.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/worker/tt_model_runner.py) if it is missing functionality for the new model. 


side note - we should be a little careful about external contributions since we don't yet have branch protections. We should perhaps later add a contributing note/guide to vLLM as well.

mtairum

Please let me know if my review commit diff introduced anything wrong.

I've changed links to tt-metal to point to internal files instead of weblink.

I'm not opposed at the basic level explanation of these section. They are short so don't detract much compared to other sections of the doc. When we have the full doc ready, we need to give a new read and update if it makes sense then.

One other thing here is that you link a lot to other source codes. While I do agree that it makes sense and makes this section less busy, other sections I've review so far opt for having more short snippets in the doc. I think both work and so leaving it like this.

mtairum · 2024-11-21T17:25:00Z

tech_reports/LLMs/llms.md

+- position ids: the position of the tokens in the sequence
+- KV cache: an inference optimization that caches intermediate values
+
+In the model, tokens are embedded from the vocabulary space to the embedding space. Position ids are necessary for updating the KV cache and for positional embeddings like RoPE [TODO: Refer to the RoPE section]. 


After RoPE section is in we need to update this link.

tech_reports/LLMs/llms.md

cglagovichTT requested a review from uaydonat November 15, 2024 19:09

skhorasganiTT reviewed Nov 18, 2024

View reviewed changes

mtairum assigned mtairum and cglagovichTT and unassigned mtairum Nov 18, 2024

mtairum self-requested a review November 18, 2024 18:28

mtairum added the LLMs on Metal label Nov 18, 2024

mtairum force-pushed the cglagovich/tech_report_3_1 branch from 69b787b to d141ec6 Compare November 21, 2024 17:23

mtairum reviewed Nov 21, 2024

View reviewed changes

uaydonat reviewed Nov 25, 2024

View reviewed changes

tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved

tech_reports/LLMs/llms.md Show resolved Hide resolved

tech_reports/LLMs/llms.md Show resolved Hide resolved

tech_reports/LLMs/llms.md Outdated Show resolved Hide resolved

uaydonat self-requested a review December 19, 2024 02:23

uaydonat approved these changes Dec 19, 2024

View reviewed changes

cglagovichTT and others added 8 commits December 24, 2024 22:52

Section 3.1 first draft

192d061

Section 3.4 first draft

c7b07df

Section 3.5 first draft

2d19e16

#0: Initial review

f8bad2e

#0: Address section 3 comments

43ba52a

#0: Remove TODO

f8ec532

#0: fix pre-commit

3d07dfc

#0: Update tech report authors

e43bd86

cglagovichTT force-pushed the cglagovich/tech_report_3_1 branch from c84b9b9 to e43bd86 Compare December 25, 2024 03:55

#0: Move tech report image

4451dff

cglagovichTT marked this pull request as ready for review December 25, 2024 03:57

cglagovichTT merged commit 8e113d0 into main Dec 25, 2024
9 checks passed

cglagovichTT deleted the cglagovich/tech_report_3_1 branch December 25, 2024 03:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM tech report sections 3.1, 3.4, 3.5 #15110

LLM tech report sections 3.1, 3.4, 3.5 #15110

cglagovichTT commented Nov 15, 2024

skhorasganiTT Nov 18, 2024

skhorasganiTT Nov 18, 2024

skhorasganiTT Nov 18, 2024

mtairum left a comment

mtairum Nov 21, 2024


		Total throughput (tokens per second) tells us the total number of tokens that the model can generate per second. `total throughput = batch size / decode step latency`. Total throughput is important for cost-sensitive systems or offline processing, where interactivity is less important than throughput. Generally, increasing batch size will increase total throughput.

		User throughput (tokens per second per user) is calculate as `user throughput = 1 / decode step latency`. User throughput tells us how interactive the model is, and tells us how fast the generation is for a single user. Generally, decreasing batch size will increase user throughput.

		On the vLLM side there may be additional changes needed to support the new model.

		Modify the [`tt_loader.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/model_executor/model_loader/tt_loader.py) if the model requires a different initialization. Modify [`tt_model_runner.py`](https://github.com/tenstorrent/vllm/blob/dev/vllm/worker/tt_model_runner.py) if it is missing functionality for the new model.

LLM tech report sections 3.1, 3.4, 3.5 #15110

LLM tech report sections 3.1, 3.4, 3.5 #15110

Conversation

cglagovichTT commented Nov 15, 2024

skhorasganiTT Nov 18, 2024

Choose a reason for hiding this comment

skhorasganiTT Nov 18, 2024

Choose a reason for hiding this comment

skhorasganiTT Nov 18, 2024

Choose a reason for hiding this comment

mtairum left a comment

Choose a reason for hiding this comment

mtairum Nov 21, 2024

Choose a reason for hiding this comment