Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rules for Llama3.1-405B #304

Merged
merged 1 commit into from
Dec 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions inference_rules.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@ Each sample has the following definition:
|SDXL |A pair of postive and negative prompts
|Llama2 |one sequence
|Mixtral-8x7B |one sequence
|Llama3.1-405B |one sequence
|===

== Benchmarks
Expand Down Expand Up @@ -256,6 +257,7 @@ The Datacenter suite includes the following benchmarks:
|Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A
|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
|Language |Question Answering |Llama2 |OpenOrca (max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Language |Text Generation |Llama3.1-405B |Subset of LongBench, LongDataCollections, Ruler, GovReport | 8313 | 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335). Additionally, for both cases tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=684.68)| TTFT/TPOTfootnote:[For Llama3.1-405B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 6000 ms/175 ms
|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | 15000 | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=144.84)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
|Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
Expand Down Expand Up @@ -350,6 +352,9 @@ For each of the following benchmarks it is necessary to use the following infere
|Summarization (GPT-J) |max_new_tokens |128 | Maximum number of new tokens to generate
|Summarization (GPT-J) |early_stopping |True | Use the EOS token to stop generating tokens
|Summarization (Llama2) |max_new_tokens |1024 | Maximum number of new tokens to generate
|Text Generation (Llama3.1-405B) |min_new_tokens |2 | Minimun number of new tokens to generate
|Text Generation (Llama3.1-405B) |max_new_tokens |20000 | Maximum number of new tokens to generate
|Summarization (Mixtral-8x7B) |min_new_tokens |2 | Minimun number of new tokens to generate
|Text Generation (Mixtral-8x7B) |max_new_tokens |1024 | Maximum number of new tokens to generate
|===

Expand Down Expand Up @@ -541,6 +546,10 @@ This rule applies both for the QSL pre-processing and for post-processing functi
No compression allowed.
|Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).

No compression allowed.
|Language | Question Answering | Llama3.1-405B | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).

No compression allowed.
|Language | Text Generation | Mixtral-8x7B | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).

No compression allowed.
Expand Down Expand Up @@ -595,7 +604,7 @@ As input, before preprocessing:

* all imaging benchmarks take uncropped uncompressed bitmap

* BERT, GPT-J, Llama2 and Mixtral-8x7B take texts
* BERT, GPT-J, Llama2, Llama3.1-405B and Mixtral-8x7B take texts

* RNN-T takes a waveform

Expand Down Expand Up @@ -865,7 +874,7 @@ The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive

Q: What algorithm is used for the auto-regressive decoding loop?

A: The algorithms used by the benchmarks (greedy search and beam search) are described at a high level here: https://huggingface.co/blog/how-to-generate. Specifically, GPT-J uses a beam width of 4 and enable early termination, while Llama2 uses greedy search.
A: The algorithms used by the benchmarks (greedy search and beam search) are described at a high level here: https://huggingface.co/blog/how-to-generate. Specifically, GPT-J uses a beam width of 4 and enable early termination, while Llama2, Llama3.1-405B and Mixtral-8x7B uses greedy search.

Q: MLPerf disallows caching queries. Is using a KV-cache in decoding allowed?

Expand Down Expand Up @@ -1029,6 +1038,7 @@ Datacenter systems must provide at least the following bandwidths from the netwo
|Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Language |Llama2 |OpenOrca (max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
|Language |Llama3.1-405B | Subset of LongBench, LongDataCollections, Ruler, GovReport | __num_inputs*max_seq_len*dtype_size__ | __20000*dtype_size__ | __throughput*20000*dtype_size__
pgmpablo157321 marked this conversation as resolved.
Show resolved Hide resolved
|Language |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ | __throughput*77*dtype_size__
Expand All @@ -1044,6 +1054,9 @@ Datacenter systems must provide at least the following bandwidths from the outpu
|Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__
|Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__
|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | negligible | negligible | __> 0__
|Language |Llama2 |OpenOrca (max_seq_len=1024) | __max_output_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
|Language |Llama3.1-405B |Subset of LongBench, LongDataCollections, Ruler, GovReport | __max_output_len*dtype_size__ | __20000*dtype_size__ | __throughput*20000*dtype_size__
|Language |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | __max_output_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
|Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__
|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __3,145,728*dtype_size__ | __throughput*3,145,728*dtype_size__ | __> 0__
|===
Loading