From 6a3f1d374604ef67300262fd9373c6b1761b48f9 Mon Sep 17 00:00:00 2001
From: itayhubara <itayh@campus.technion.ac.il>
Date: Tue, 24 Oct 2023 19:15:57 +0300
Subject: [PATCH 01/16] updating audit rules to ensure fair voting

---
 inference_rules.adoc | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index eac86e9..beb6dac 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -106,6 +106,10 @@ In each round, up to two submissions will be audited: one at random from all sub
 
 The process of random selection is in two stages: first a submitter is randomly chosen from all submitters with auditable submissions, then one of those submissions is randomly chosen. A submission is not a candidate for the randomly chosen audit if the system is equivalent to a system audited in the previous round. For the purposes of this rule, equivalent systems have the same CPU, NIC, accelerator, and accelerator count, with the same configuration of those components as per the system configuration JSON. For LoadGen Over Network submission the Networking must be the same. The review committee may determine that additional systems are equivalent to those audited in a previous round and exempt them from random audit. As a guidance for this exemption, if an accelerator is audited in one of the previous rounds, then the systems using the same accelerator can be excluded from random audit, if the aggregate system performance and the performance per accelerator are not more than 10% from those submitted during last audit time. For systems with power metrics, in addition to the performance, power efficiency must also be within 10% from the last audit time to be eligible for an exclusion from random audit. If any new result like a new model, an additional non-inferred scenario measurement or a new power measurement is submitted from the last audit time, then the exclusion is not applicable unless the review committee decides otherwise.
 
+To ensure equity, if a submitter undergoes consecutive audits spanning two or more rounds, the committee must compose a brief explanation outlining the discrepancies between the current submission and the prior submission outcomes. If the auditee perceives this as unjust, they retain the option to submit an appeal to MLCommons board.
+
+In addtion, if a submitter receive their code base from another submitter and run on a similar hardware, they can nominate systems for an audit but are ineligible to participate in ranked choice voting.
+
 During the review process, a github issue shall be opened where submitters can nominate systems for audit. Each nomination shall contain a reason, such as new HW or SW, unusual or interesting features, performance outside of expectations, etc. Review committee chairs evaluate the nominations and compile a list of systems at the end of the review period. Any systems with new accelerators are added to the list by the chairs if not nominated. The review committee will select a submission for audit by ranked choice voting using a simple majority. An option "No Selected Audit This Round" may be added if requested by a majority of the review committee.
 
 An auditor shall be chosen by the review committee who has no conflict of interest with the submitter. The process of auditor selection will take no more than 28 days from selection of the submitter.

From 5a118e527bc263a5f96c28eeadfa44f0c93f956f Mon Sep 17 00:00:00 2001
From: itayhubara <itayh@campus.technion.ac.il>
Date: Tue, 14 Nov 2023 19:18:32 +0200
Subject: [PATCH 02/16] update per WG decision

---
 inference_rules.adoc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index beb6dac..c8a4a31 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -106,9 +106,7 @@ In each round, up to two submissions will be audited: one at random from all sub
 
 The process of random selection is in two stages: first a submitter is randomly chosen from all submitters with auditable submissions, then one of those submissions is randomly chosen. A submission is not a candidate for the randomly chosen audit if the system is equivalent to a system audited in the previous round. For the purposes of this rule, equivalent systems have the same CPU, NIC, accelerator, and accelerator count, with the same configuration of those components as per the system configuration JSON. For LoadGen Over Network submission the Networking must be the same. The review committee may determine that additional systems are equivalent to those audited in a previous round and exempt them from random audit. As a guidance for this exemption, if an accelerator is audited in one of the previous rounds, then the systems using the same accelerator can be excluded from random audit, if the aggregate system performance and the performance per accelerator are not more than 10% from those submitted during last audit time. For systems with power metrics, in addition to the performance, power efficiency must also be within 10% from the last audit time to be eligible for an exclusion from random audit. If any new result like a new model, an additional non-inferred scenario measurement or a new power measurement is submitted from the last audit time, then the exclusion is not applicable unless the review committee decides otherwise.
 
-To ensure equity, if a submitter undergoes consecutive audits spanning two or more rounds, the committee must compose a brief explanation outlining the discrepancies between the current submission and the prior submission outcomes. If the auditee perceives this as unjust, they retain the option to submit an appeal to MLCommons board.
-
-In addtion, if a submitter receive their code base from another submitter and run on a similar hardware, they can nominate systems for an audit but are ineligible to participate in ranked choice voting.
+If a submitter undergoes consecutive audits spanning two or more rounds and finds it unfair, they can appeal to the MLCommons board to ensure fairness
 
 During the review process, a github issue shall be opened where submitters can nominate systems for audit. Each nomination shall contain a reason, such as new HW or SW, unusual or interesting features, performance outside of expectations, etc. Review committee chairs evaluate the nominations and compile a list of systems at the end of the review period. Any systems with new accelerators are added to the list by the chairs if not nominated. The review committee will select a submission for audit by ranked choice voting using a simple majority. An option "No Selected Audit This Round" may be added if requested by a majority of the review committee.
 

From 7ab16cd413a1c41c2f5fe82dc684563b338e594b Mon Sep 17 00:00:00 2001
From: itayhubara <itayh@campus.technion.ac.il>
Date: Tue, 21 Nov 2023 19:56:30 +0200
Subject: [PATCH 03/16] fix wording per WG decision

---
 inference_rules.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index c8a4a31..33556fc 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -106,7 +106,7 @@ In each round, up to two submissions will be audited: one at random from all sub
 
 The process of random selection is in two stages: first a submitter is randomly chosen from all submitters with auditable submissions, then one of those submissions is randomly chosen. A submission is not a candidate for the randomly chosen audit if the system is equivalent to a system audited in the previous round. For the purposes of this rule, equivalent systems have the same CPU, NIC, accelerator, and accelerator count, with the same configuration of those components as per the system configuration JSON. For LoadGen Over Network submission the Networking must be the same. The review committee may determine that additional systems are equivalent to those audited in a previous round and exempt them from random audit. As a guidance for this exemption, if an accelerator is audited in one of the previous rounds, then the systems using the same accelerator can be excluded from random audit, if the aggregate system performance and the performance per accelerator are not more than 10% from those submitted during last audit time. For systems with power metrics, in addition to the performance, power efficiency must also be within 10% from the last audit time to be eligible for an exclusion from random audit. If any new result like a new model, an additional non-inferred scenario measurement or a new power measurement is submitted from the last audit time, then the exclusion is not applicable unless the review committee decides otherwise.
 
-If a submitter undergoes consecutive audits spanning two or more rounds and finds it unfair, they can appeal to the MLCommons board to ensure fairness
+If a submitter chosen for an audit finds it unfair, they can appeal to the MLCommons Executive Director to ensure fairness.
 
 During the review process, a github issue shall be opened where submitters can nominate systems for audit. Each nomination shall contain a reason, such as new HW or SW, unusual or interesting features, performance outside of expectations, etc. Review committee chairs evaluate the nominations and compile a list of systems at the end of the review period. Any systems with new accelerators are added to the list by the chairs if not nominated. The review committee will select a submission for audit by ranked choice voting using a simple majority. An option "No Selected Audit This Round" may be added if requested by a majority of the review committee.
 

From 5b9b5a8c458ab73b7de05e9f055d407b36582268 Mon Sep 17 00:00:00 2001
From: Zhihan <zhihanj@nvidia.com>
Date: Thu, 4 Jan 2024 13:29:41 -0800
Subject: [PATCH 04/16] Add and revise rules related to llama2

---
 inference_rules.adoc | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index a86dec7..6c73773 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -176,6 +176,7 @@ Each sample has the following definition:
 |BERT	            |one sequence
 |DLRMv2	            |up to 700 user-item pairs (more details in FAQ)
 |GPT-J	            |one sequence
+|Llama2	            |one sequence
 |===
 
 == Benchmarks
@@ -251,6 +252,7 @@ The Datacenter suite includes the following benchmarks:
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
+|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms for conversationalfootnote:llamalatency[For Llama2, there are 2 latency contraints: conversational and near real-time. The user can choose either (or both) of the contraints, and report the achieved performance number.]; 500 ms/50 ms for near real-timefootnote:llamalatency[]
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |===
 
@@ -263,6 +265,8 @@ Each Datacenter benchmark *requires* the following scenarios:
 |Vision |Medical image segmentation |Offline
 |Speech |Speech-to-text |Server, Offline
 |Language |Language processing |Server, Offline
+|Language |Summarization |Server, Offline
+|Language |Question Answering |Server, Offline
 |Commerce |Recommendation |Server, Offline
 |===
 
@@ -287,6 +291,7 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an
 |Vision |Medical image segmentation |Single Stream, Offline
 |Speech |Speech-to-text |Single Stream, Offline
 |Language |Language processing |Single Stream, Offline
+|Language |Summarization |Single Stream, Offline
 |===
 
 
@@ -340,6 +345,7 @@ For each of the following benchmarks it is necessary to use the following infere
 |Summarization (GPT-J) |min_new_tokens |30 | Minimun number of new tokens to generate
 |Summarization (GPT-J) |max_new_tokens |128 | Maximum number of new tokens to generate
 |Summarization (GPT-J) |early_stopping |True | Use the EOS token to stop generating tokens
+|Summarization (Llama2) |max_new_tokens |1024 | Maximum number of new tokens to generate
 |===
 
 == Load Generator
@@ -526,7 +532,10 @@ This rule applies both for the QSL pre-processing and for post-processing functi
 |Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 
 
 1) No compression 2) Lossless compression
-|Language | Language processing | GPT-J | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 
+|Language | Summarization | GPT-J | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). 
+
+No compression allowed.
+|Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). 
 
 No compression allowed.
 |Commerce | Recommendation | DLRMv2 | QDL sends query (Batch of samples).
@@ -578,7 +587,7 @@ As input, before preprocessing:
 
 * all imaging benchmarks take uncropped uncompressed bitmap
 
-* BERT takes text
+* BERT, GPT-J, Llama2 take texts
 
 * RNN-T takes a waveform
 
@@ -600,6 +609,8 @@ untimed. However, it must be pre-approved and added to the following list:
 
 * May convert data among numerical formats
 
+* May convert to token ids from texts using the reference tokenizer
+
 Any other pre- and post-processing time is included in the wall-clock time for a
 run result.
 
@@ -619,7 +630,7 @@ task. Retraining is allowed.
 
 === Weight Definition and Quantization
 
-CLOSED: MLPerf will provide trained weights and biases in fp32 format for both
+CLOSED: MLPerf will provide trained weights and biases in fp16/fp32 format for both
 the reference and alternative implementations.
 
 MLPerf will provide a calibration data set for all models.
@@ -740,6 +751,8 @@ The following techniques are disallowed:
 * Techniques that only improve performance when there are identical
   samples in a query. For example, sorting samples in SSD.
 
+* Speculative decoding for auto-generative language models (i.e. using a smaller model to predict the next token for the reference model).
+
 == FAQ
 
 Q: Do I have to use the reference implementation framework?
@@ -844,7 +857,7 @@ The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive
 
 Q: What algorithm is used for the auto-regressive decoding loop?
 
-A: The benchmark uses the beam search algorithm described at a high level here: https://huggingface.co/blog/how-to-generate#beam-search. Specifically, we use a beam width of 4 and enable early termination.
+A: The algorithms used by the benchmarks (greedy search and beam search) are described at a high level here: https://huggingface.co/blog/how-to-generate. Specifically, GPT-J uses a beam width of 4 and enable early termination, while Llama2 uses greedy search.
 
 Q: MLPerf disallows caching queries. Is using a KV-cache in decoding allowed?
 
@@ -854,6 +867,10 @@ Q: Is it allowed to not use a KV-cache or use it partially?
 
 A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cache, but if you do, your implementation must adhere to the reference implementation. If you do not use a KV-cache, the corresponding values must be rematerialized during the decoding process. 
 
+Q: Is it allowed to store continuous keys and values in non-contiguous memory space for the KV-cache, i.e. PagedAttention?
+
+A: Yes, it is allowed as long as the KV-cache block is reused only within the batch of queries. PagedAttention is expliained at a high level here: https://blog.vllm.ai/2023/06/20/vllm.html.
+
 Q: How does quantization and pruning apply to the KV-cache?
 
 A: The entries of the KV-cache should be handled in the same way as the activations of a forward pass. They can be quantized according to the quantization rules. However, according to the model equivalence rules, they cannot be pruned (or sparsified). It should be noted that pruning is different from not using a KV-cache (or caching only some entries while rematerializing others); pruning alters the computation and the model's predictions.
@@ -862,6 +879,10 @@ Q: How does query batching affect the KV-cache usage?
 
 A: The size of the KV-cache is determined by the batch size. The KV-cache size can also be cached across queries, in accordance with the rule of allowing caching of sizes and shapes. Other than batching and quantization rules (that apply to activations), alternative attention mechanisms (such as paged, multi-query, sparse, group query attention, etc.) or wholesale replacement of the reference KV-cache execution are not permitted.
 
+Q: Is it allowed to apply continuous batching (or dynamic batching) for auto-generative benchmarks?
+
+A: Yes. Continuous batching is explained at a high level here: https://www.anyscale.com/blog/continuous-batching-llm-inference.
+
 === Audit
 
 Q: What characteristics of my submission will make it more likely to be audited?
@@ -999,7 +1020,8 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[The average image size above is the average image size of the inference cases specified in https://github.com/mlcommons/inference/blob/master/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json[inference_cases.json].] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__
 |Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | __max_audio_duration*num_samples_per_sec*(bits_per_sample/8)__ | __15*16000*(16/8)__ | __throughput*480000__
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
-|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __3*2048*dtype_size__ | __throughput*6144*dtype_size__
+|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
+|Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
 |Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
 |===
 

From 912233f7ec3a21e03f2219f04f6c2a53c9be26b0 Mon Sep 17 00:00:00 2001
From: Zhihan <zhihanj@nvidia.com>
Date: Mon, 8 Jan 2024 11:19:16 -0800
Subject: [PATCH 05/16] Small fix

---
 inference_rules.adoc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 6c73773..ae3c952 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -252,7 +252,7 @@ The Datacenter suite includes the following benchmarks:
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
-|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms for conversationalfootnote:llamalatency[For Llama2, there are 2 latency contraints: conversational and near real-time. The user can choose either (or both) of the contraints, and report the achieved performance number.]; 500 ms/50 ms for near real-timefootnote:llamalatency[]
+|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=28124112)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms for conversationalfootnote:llamalatency[For Llama2, there are 2 latency contraints: conversational and near real-time. The user can choose either (or both) of the contraints, and report the achieved performance number.]; 500 ms/50 ms for near real-timefootnote:llamalatency[]
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |===
 
@@ -869,7 +869,7 @@ A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cac
 
 Q: Is it allowed to store continuous keys and values in non-contiguous memory space for the KV-cache, i.e. PagedAttention?
 
-A: Yes, it is allowed as long as the KV-cache block is reused only within the batch of queries. PagedAttention is expliained at a high level here: https://blog.vllm.ai/2023/06/20/vllm.html.
+A: Yes, it is allowed as long as the KV-cache block is reused only within the batch of queries. A high level explanation of PagedAttention can be found here: https://blog.vllm.ai/2023/06/20/vllm.html.
 
 Q: How does quantization and pruning apply to the KV-cache?
 

From 8caac576d22f1f09084c6efaea0849c065ed1b92 Mon Sep 17 00:00:00 2001
From: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Date: Tue, 9 Jan 2024 18:45:56 +0200
Subject: [PATCH 06/16] Added StableDiffusionXL (SDXL) benchmark to rules

---
 inference_rules.adoc | 96 ++++++++++++++++++++++++--------------------
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index a86dec7..160a828 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -121,7 +121,7 @@ The submitter will provide the auditor an NDA within seven days of the auditor's
 The auditor will submit their report to the submitter no more than thirty days after executing all relevant NDAs. The submitter will make any necessary redactions due to NDAs and forward the finalized report to the review committee within seven days. The auditor will confirm the accuracy of the forwarded report.
 
 Submissions that fail the audit at a material level will be moved to open or removed, by review committee decision.
-If a submission failed an audit that was delayed past publication, then any published material concerning the invalidated result is subject to the MLCommons [rules for Violation Determination, Remedies and Penalties](https://github.com/mlcommons/policies/blob/master/MLPerf_Results_Messaging_Guidelines.adoc#12-violation-determination-remedies-and-penalties) for remedial action. 
+If a submission failed an audit that was delayed past publication, then any published material concerning the invalidated result is subject to the MLCommons [rules for Violation Determination, Remedies and Penalties](https://github.com/mlcommons/policies/blob/master/MLPerf_Results_Messaging_Guidelines.adoc#12-violation-determination-remedies-and-penalties) for remedial action.
 
 MLCommons shall retain a library of past audit reports and send copies to MLCommons members, auditors, and potential auditors by request. Audit reports will not be further distributed without permission from the audited submitter.
 
@@ -176,6 +176,7 @@ Each sample has the following definition:
 |BERT	            |one sequence
 |DLRMv2	            |up to 700 user-item pairs (more details in FAQ)
 |GPT-J	            |one sequence
+|SDXL	            |A pair of postive and negative prompts
 |===
 
 == Benchmarks
@@ -252,18 +253,20 @@ The Datacenter suite includes the following benchmarks:
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
+|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===
 
 Each Datacenter benchmark *requires* the following scenarios:
 
 |===
-|Area |Task |Required Scenarios 
+|Area |Task |Required Scenarios
 |Vision |Image classification |Server, Offline
 |Vision |Object detection |Server, Offline
 |Vision |Medical image segmentation |Offline
 |Speech |Speech-to-text |Server, Offline
 |Language |Language processing |Server, Offline
 |Commerce |Recommendation |Server, Offline
+|Generative |Text to image |Server, Offline
 |===
 
 The Edge suite includes the following benchmarks:
@@ -276,6 +279,7 @@ The Edge suite includes the following benchmarks:
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds)| 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%)
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%)
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
+|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
 |===
 
 Each Edge benchmark *requires* the following scenarios, and sometimes permit an optional scenario:
@@ -287,21 +291,22 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an
 |Vision |Medical image segmentation |Single Stream, Offline
 |Speech |Speech-to-text |Single Stream, Offline
 |Language |Language processing |Single Stream, Offline
+|Generative |Text to image |Server, Offline
 |===
 
 
 Edge submitters are allowed to infer a multistream result from single stream, and
-an offline result from either a single stream result or a measured multistream result, 
+an offline result from either a single stream result or a measured multistream result,
 according to the following rules:
 
 - a multistream result inferred from a single stream result is 8 times the 99th percentile latency
 reported by loadgen. For example, if the single stream 99%th percentile latency is 25ms, the inferred multistream result is 200ms.
 
 - an offline result inferred from a multistream result is 8000 divided by the mean latency in milliseconds. For example,
-if the multistream result is 200ms, the inferred offline result is 40 img/s. 
+if the multistream result is 200ms, the inferred offline result is 40 img/s.
 
 - an offline result inferred from a single stream result is 1000 divided by the mean latency in milliseconds. For example,
-if the single stream result is 25ms, the inferred offline result is 40 img/s. 
+if the single stream result is 25ms, the inferred offline result is 40 img/s.
 
 The accuracy of an inferred result will be the same as the result from which it was inferred. When inferring a metric for the power table, the measured power used to calculate the metric is the same as for the base result
 
@@ -317,7 +322,7 @@ replacement) from a test set. The minimum size of the performance test set for
 each benchmark is listed as 'QSL Size' in the table above. However, the accuracy
  test must be run with one copy of the MLPerf specified validation dataset.
 
-For 3DUNet, the logical destination for the benchmark output is considered to be the network. 
+For 3DUNet, the logical destination for the benchmark output is considered to be the network.
 
 ==== Relaxed constraints for the Open division
 
@@ -402,14 +407,14 @@ The execution of LoadGen is restricted as follows:
   network and the system is characterized as host - accelerator, then LoadGen
   should run on the host unless the accelerator incorporates a NIC.
 
-* The trace generated by LoadGen must be stored in the DRAM that most faithfully simulates queries arriving 
-  from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned. 
-  Similarly, the response provided to Loadgen must be stored in the DRAM that most faithfully simulates transfer 
-  to the most logical destination, which is a CPU process unless otherwise specified for the benchmark. 
-  From 4.0, submitters must provide with their submission sufficient details of the system architecture and software to  
+* The trace generated by LoadGen must be stored in the DRAM that most faithfully simulates queries arriving
+  from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned.
+  Similarly, the response provided to Loadgen must be stored in the DRAM that most faithfully simulates transfer
+  to the most logical destination, which is a CPU process unless otherwise specified for the benchmark.
+  From 4.0, submitters must provide with their submission sufficient details of the system architecture and software to
   show how the I/O bandwidth utilized by each benchmark/scenario combination can be transferred between the memory where the trace is stored and
   the network or I/O device. Minimum bandwidths for each benchmark can be found in <<appendix-bw>>. All components mentioned in the system architecture must be present in the system during the run. A system architecture description must be provided along with the submission, which must include:
-  
+
 ** Bandwidth of each NIC and total number of NIC(s)
 ** Description of the data path from the NIC(s) to the accelerator(s)
 ** Specifications or measurements indicating that the path from the NIC to the memory in which loadgen data resides can sustain the required bandwidth
@@ -429,10 +434,10 @@ optionally be incrementally generated if it does not fit in memory. LoadGen
 validates accuracy via a separate test run that use each sample in the test
 library exactly once but is otherwise identical to the above normal metric run.
 
-One LoadGen validation run is required for each submitted performance result 
+One LoadGen validation run is required for each submitted performance result
 even if two or more performance results share the same source code.
 
-Note: The same code must be run for both the accuracy and performance LoadGen modes. This means the same output should be passed in QuerySampleComplete in both modes. 
+Note: The same code must be run for both the accuracy and performance LoadGen modes. This means the same output should be passed in QuerySampleComplete in both modes.
 
 == Divisions
 
@@ -463,10 +468,10 @@ Non-conforming network submission should be submitted to Open category, under th
 * The QDL is not allowed to pad the data in queries.
 * The QDL is not allowed to cache queries or responses.
 * The QDL is implementing the network function of the LoadGen Node towards the SUT node and handles the required processing. E.G. padding of the payload as required by the network protocol.
-* The QDL should reflect a single SUT to the LoadGen. LoadGen operates with a single SUT. 
+* The QDL should reflect a single SUT to the LoadGen. LoadGen operates with a single SUT.
 * The Name method's return value must contain the substring "Network SUT".
 * The Name method's implementation must include at least one round trip over the network. The Name method must not return until the round trip is complete.
-* The QDL must query each SUT Node for its name and aggregate the responses in the Name Method. Each SUT Node must have a unique name.   
+* The QDL must query each SUT Node for its name and aggregate the responses in the Name Method. Each SUT Node must have a unique name.
 
 The submission must include source code for the QDL implementation above the level of the OSI session layer (RPC or equivalent), and sufficient documentation of the session layer API that a reader of that code can understand what data is being marshalled and sent over the network for each query.
 
@@ -496,9 +501,9 @@ Fabric and protocol must be reported in the submission metadata. Submission meta
 
 * SUT parameters and configuration must be uniquely and specifically named in the submission results.
 * Everything outside the LoadGen node should be considered as part of the SUT, for instance for counting power and latency. As an example, components outside the nodes like a switch or load balancer should be considered part of the SUT.
-* All queries must be transferred over the network, carrying the inference data, for inference execution at the SUT. All responses must be transferred back over the network, carrying the inference responses. 
+* All queries must be transferred over the network, carrying the inference data, for inference execution at the SUT. All responses must be transferred back over the network, carrying the inference responses.
 * Caching/Storing of the queries and inference data or responses for further use at the SUT is disallowed. It is allowed to cache/store other data like Neural Network weights or Neural Network executable.
-* SUT can do the required pre-processing of the data, e.g. Batching, Padding, processing of the requests (precision, data layout), compression, decompression. SUT can do the required post processing functions e.g. gather, reduction or ArgMax. 
+* SUT can do the required pre-processing of the data, e.g. Batching, Padding, processing of the requests (precision, data layout), compression, decompression. SUT can do the required post processing functions e.g. gather, reduction or ArgMax.
 * The report must contain network interface characteristics for both the Loadgen and SUT systems, and every other component through which data passes between Loadgen and SUT. The information must be sufficient for reproducibility.
 * A system diagram must be included in the submission that shows how the components between the LoadGen node and the SUT nodes are connected, accompanied by any text necessary for another submitter to understand the diagram.
 * For "Available" submissions, for reproducibility, it is required to specify software version of all components, hardware configurations, software stacks, dockers, and settings of all components and stacks.
@@ -514,7 +519,7 @@ Data formats for inputs and outputs are allowed to be compressed for network tra
 1) No compression 2) Lossless compression 3) The original compression of the dataset (JPEG)
 |Vision | Object detection (large) | Retinanet | Allow one of the following compression options for pre-processing:
 
-1) No compression 2) Lossless compression 3) The original compression of the dataset (For the Coco dataset JPEG, for Open Images JPEG) 
+1) No compression 2) Lossless compression 3) The original compression of the dataset (For the Coco dataset JPEG, for Open Images JPEG)
 |Vision | Medical image segmentation | 3D UNET | Allow one of the following compression options:
 
 1) No compression 2) Lossless compression
@@ -523,10 +528,10 @@ This rule applies both for the QSL pre-processing and for post-processing functi
 |Speech | Speech-to-text | RNNT | Allow one of the following compression options for pre-processing:
 
 1) No compression 2) Lossless compression 3) The original compression of the dataset (FLAC)
-|Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 
+|Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation).
 
 1) No compression 2) Lossless compression
-|Language | Language processing | GPT-J | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 
+|Language | Language processing | GPT-J | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation).
 
 No compression allowed.
 |Commerce | Recommendation | DLRMv2 | QDL sends query (Batch of samples).
@@ -537,6 +542,9 @@ Allow one of the following compression options for pre-processing:
 
 Allow any lossless compression that will be suitable for production use.
 In Server mode allow per-Query compression.
+|Generative | Text to image | SDXL | Allow one of the following compression options:
+
+1) No compression 2) Lossless compression
 |===
 
 . Compression scheme needs pre-approval, at least two weeks before a submission deadline.
@@ -550,7 +558,7 @@ including retraining.  The qualified name “MLPerf Open” must be used when
 referring to an Open Division suite result, e.g. “a MLPerf Open result of 7.2.”
 
 https://github.com/mlperf/inference_policies/blob/master/inference_retraining_rules.adoc[Restricted retraining rules]
-characterize a subset of Open division retraining possibilities that are expected to be straightforward for customers to use. 
+characterize a subset of Open division retraining possibilities that are expected to be straightforward for customers to use.
 The restrictions are optional; conformance will be indicated by a tag on the submission.
 
 == Data Sets
@@ -697,7 +705,7 @@ Examples of allowed techniques include, but are not limited to:
 
 * Empirical performance and accuracy tuning based on the performance and accuracy
   set (eg. selecting batch sizes or numerics experimentally)
-  
+
 * Sorting an embedding table based on frequency of access in the training set.
   (Submitters should include in their submission details of how the ordering was
   derived.)
@@ -725,7 +733,7 @@ The following techniques are disallowed:
 
 * Using knowledge of the LoadGen implementation to predict upcoming lulls or
   spikes in the server scenario
-  
+
 * Treating beams in a beam search differently. For example, employing different
   precision for different beams
 
@@ -755,7 +763,7 @@ division must match what the reference is doing.
 
 Q: Can I submit a single benchmark (e.g., object detection) in a suite (e.g., data center), or do I have to submit all benchmarks?
 
-A: You can submit any of the benchmarks that are interesting, from just one benchmark to the entire set of benchmarks. Keep in mind that submitting one benchmark typically requires running several scenarios as described in Section 4. For example, submitting object detection in the data center suite requires the server and offline scenario and submitting object detection in the edge suite requires the single stream and offline scenarios. 
+A: You can submit any of the benchmarks that are interesting, from just one benchmark to the entire set of benchmarks. Keep in mind that submitting one benchmark typically requires running several scenarios as described in Section 4. For example, submitting object detection in the data center suite requires the server and offline scenario and submitting object detection in the edge suite requires the single stream and offline scenarios.
 
 Q: Why does a run require so many individual inference queries?
 
@@ -819,25 +827,25 @@ A: For all scenarios, the distribution of user-item pairs per sample is specifie
 
 Q: What is https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_trace_verification.txt[dist_trace_verification.txt]?
 
-The benchmark provides a pre-defined quantile distribution in `./tools/dist_quantile.txt` from which the samples will be drawn using the inverse transform algorithm. This algorithm relies on randomly drawn numbers from the interval [0,1) and that depend on the `--numpy-rand-seed`, which specific value will be provided shortly before MLPerf inference submissions.	
+The benchmark provides a pre-defined quantile distribution in `./tools/dist_quantile.txt` from which the samples will be drawn using the inverse transform algorithm. This algorithm relies on randomly drawn numbers from the interval [0,1) and that depend on the `--numpy-rand-seed`, which specific value will be provided shortly before MLPerf inference submissions.
 
 Q: What is the rational for the distribution of user-item pairs?
 
-In the case of DLRMv2 we have agreed that we should use multiple samples drawn from a distribution, similar to the one shown on Fig. 5: "Queries for personalized recommendation models" in the https://arxiv.org/abs/2001.02772[DeepRecSys] paper.	
+In the case of DLRMv2 we have agreed that we should use multiple samples drawn from a distribution, similar to the one shown on Fig. 5: "Queries for personalized recommendation models" in the https://arxiv.org/abs/2001.02772[DeepRecSys] paper.
 
 Q: Generating dlrm_trace_of_aggregated_samples.txt uses a pseudo-random number generator. How can submitters verify their system pseudo-random number generator is compatible?
 
-Submitters can verify their compatibility by using the default `--numpy-rand-seed` and comparing the trace generated on their system with `./tools/dist_trace_verification.txt` using the following command	
-```	
-./run_local.sh pytorch dlrm terabyte cpu --count-samples=100 --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=128	
+Submitters can verify their compatibility by using the default `--numpy-rand-seed` and comparing the trace generated on their system with `./tools/dist_trace_verification.txt` using the following command
+```
+./run_local.sh pytorch dlrm terabyte cpu --count-samples=100 --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=128
 ```
 
 Q: I understand that `--samples-to-aggregate-quantile-file=./tools/dist_quantile.txt` is the only compliant setting for MLPerf, but what are the alternative settings and what do they do?
 
-The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive samples together into a single aggregated sample. The number of samples to be aggregated can be selected using either of the following options	
+The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive samples together into a single aggregated sample. The number of samples to be aggregated can be selected using either of the following options
 
-1. fixed [`--samples-to-aggregate-fix`]	
-2. drawn uniformly from interval [`--samples-to-aggregate-min`, `--samples-to-aggregate-max`]	
+1. fixed [`--samples-to-aggregate-fix`]
+2. drawn uniformly from interval [`--samples-to-aggregate-min`, `--samples-to-aggregate-max`]
 3. drawn from a custom distribution, with its quantile (inverse of CDP) specified in `--samples-to-aggregate-quantile-file=./tools/dist_quantile.txt`.
 
 === LLM Benchmarks
@@ -852,7 +860,7 @@ A: Using a KV-cache is allowed in the same way as it is included in the referenc
 
 Q: Is it allowed to not use a KV-cache or use it partially?
 
-A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cache, but if you do, your implementation must adhere to the reference implementation. If you do not use a KV-cache, the corresponding values must be rematerialized during the decoding process. 
+A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cache, but if you do, your implementation must adhere to the reference implementation. If you do not use a KV-cache, the corresponding values must be rematerialized during the decoding process.
 
 Q: How does quantization and pruning apply to the KV-cache?
 
@@ -883,7 +891,7 @@ A: You should expect to provide the following:
 The auditor may also request source code access to binary elements of the submission software. Where information or access is not provided, the auditor's report will list the issues that could not be resolved.
 
 Q: Is it expected that an audit will be concluded during the review period?
-A: No. We should try to finish the audit before the publication date. 
+A: No. We should try to finish the audit before the publication date.
 
 [[appendix-early_stopping]]
 [appendix]
@@ -894,9 +902,9 @@ The early stopping criterion allows for systems to process a smaller number of q
 
 === Motivating Example
 
-Processing more queries allows us to better estimate the percentage of the time a system passes a given latency bound, p. However, if p is particularly high, then with fewer queries we will have a larger margin-of-error, but will still be statistically confident that it is above the required threshold. Because the benchmark threshold is what we really care about (and not closely estimating p), early stopping allows submitters to process fewer queries in such cases. 
+Processing more queries allows us to better estimate the percentage of the time a system passes a given latency bound, p. However, if p is particularly high, then with fewer queries we will have a larger margin-of-error, but will still be statistically confident that it is above the required threshold. Because the benchmark threshold is what we really care about (and not closely estimating p), early stopping allows submitters to process fewer queries in such cases.
 
-Suppose we have a benchmark that requires that submissions achieve a given latency bound 90% of the time. We have system A which achieves this latency bound 99% of the time, and system B which achieves it 91% of the time. In order to have a 99% confidence interval with a margin-of-error of 0.50%, we must perform 23,886 inferences. 
+Suppose we have a benchmark that requires that submissions achieve a given latency bound 90% of the time. We have system A which achieves this latency bound 99% of the time, and system B which achieves it 91% of the time. In order to have a 99% confidence interval with a margin-of-error of 0.50%, we must perform 23,886 inferences.
 
 This makes sense for system B (whose underlying probability, 91%, is very close to the required benchmark percentile of 90%). However, assuming we see close to 99% of the queries passing the latency requirement for system A, we will be 99% sure that the underlying probability of success for a query on A will be within 99% 土 0.50%. This range is well above the requested latency percentile of 90%. Therefore, by performing fewer queries for such a system, we could widen the margin-of-error slightly, while still being statistically certain of being above the latency benchmark.
 
@@ -904,30 +912,30 @@ This makes sense for system B (whose underlying probability, 91%, is very close
 
 Suppose we have a system that meets its latency requirement for each query with probability p. What are the odds that we see at least h underlatency queries and at most t overlatency queries? We can answer this by using the cumulative distribution function for the binomial distribution.
 
-We can think of processing queries as performing n Bernoulli trials, with probability of success for any given trial (i.e., odds of being underlatency) equal to p. The probability of exactly k successes (underlatency queries) is equal to: 
+We can think of processing queries as performing n Bernoulli trials, with probability of success for any given trial (i.e., odds of being underlatency) equal to p. The probability of exactly k successes (underlatency queries) is equal to:
 
 f(k; n, p) = P(k successes) = (n choose k) * p^k * (1-p)^(n-k)
 
-For fixed n and p, f(k; n, p) is called the binomial distribution with parameters n and p. 
+For fixed n and p, f(k; n, p) is called the binomial distribution with parameters n and p.
 
 In order to determine how unusual our distribution of latency successes and failures is given the underlying probability of passing the latency bound (p), we compute the probability that we had at most h successes, keeping the total number of queries, n, fixed. This, by definition, involves computing the cumulative density function for our binomial distribution, F(h; n, p):
 
 F(h; n, p) = ∑ f(k; n, p),
- 
+
 with the summation going from k = h to n.
 
-Note that, holding h and n fixed, this probability decreases as p increases. This is because, as p gets larger, the odds that our n queries produced results at least as poor as h successes and t failures decreases. In other words, it is harder to achieve a larger number of failures when the underlying probability of an individual success is higher. 
+Note that, holding h and n fixed, this probability decreases as p increases. This is because, as p gets larger, the odds that our n queries produced results at least as poor as h successes and t failures decreases. In other words, it is harder to achieve a larger number of failures when the underlying probability of an individual success is higher.
 
 This cumulative distribution function for the binomial distribution, F(k; n, p), can be written in terms of the regularized incomplete beta function. The (unregularized) incomplete beta function is defined as:
 
 B(x; a, b) = ∫t^(a - 1) * (1-t)^(b-1) dt,
-where the integral goes from 0 to x.  
+where the integral goes from 0 to x.
 
 We can regularize this to attain:
 
 I(x; a, b) = B(x; a, b) / B(1; a, b).
 
-Note that this is "regularized" in the sense that I(0; a, b) = 0, and I(1; a, b) = 1. 
+Note that this is "regularized" in the sense that I(0; a, b) = 0, and I(1; a, b) = 1.
 
 We have an alternate expression for F(k; n, p) in terms of this function:
 
@@ -987,7 +995,7 @@ For our implementation, we use:
 [appendix]
 == Datacenter Bandwidth Requirements
 
-Datacenter systems must satisfy both the ingress and egress bandwidth requirements for each benchmark. 
+Datacenter systems must satisfy both the ingress and egress bandwidth requirements for each benchmark.
 
 === Ingress Bandwidth
 Datacenter systems must provide at least the following bandwidths from the network or I/O device to the location where the trace is stored (e.g. DRAM). The minimum bandwidth is a function of the throughput achieved by the SUT and the input data types. The formulas below assume that the inputs are not pre-processed in any way (e.g. padded). If the inputs are pre-processed, and pre-processing affects the input size, submitters must adjust the formulas below accordingly.

From 83391bed0156e9aa448b92d85d699ef8864b81c1 Mon Sep 17 00:00:00 2001
From: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Date: Tue, 9 Jan 2024 19:50:10 +0200
Subject: [PATCH 07/16] Dropped SDXL-edge

---
 inference_rules.adoc | 1 -
 1 file changed, 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 160a828..141c64b 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -291,7 +291,6 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an
 |Vision |Medical image segmentation |Single Stream, Offline
 |Speech |Speech-to-text |Single Stream, Offline
 |Language |Language processing |Single Stream, Offline
-|Generative |Text to image |Server, Offline
 |===
 
 

From 13494db4eee628f7983adee45517961088155eef Mon Sep 17 00:00:00 2001
From: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Date: Tue, 9 Jan 2024 19:53:06 +0200
Subject: [PATCH 08/16] added "Single Stream" to SDXL-edge

---
 inference_rules.adoc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 141c64b..aba1670 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -291,6 +291,7 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an
 |Vision |Medical image segmentation |Single Stream, Offline
 |Speech |Speech-to-text |Single Stream, Offline
 |Language |Language processing |Single Stream, Offline
+|Generative |Text to image |Single Stream, Offline
 |===
 
 

From 0cb6038710890f022ff3a2cd522439e58ba9d1fa Mon Sep 17 00:00:00 2001
From: Ahmad Kiswani <kiswani.ahmad@gmail.com>
Date: Thu, 11 Jan 2024 18:23:26 +0200
Subject: [PATCH 09/16] [SDXL] changed compression rules to: No compression
 allowed

---
 inference_rules.adoc | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index aba1670..ac12854 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -542,9 +542,8 @@ Allow one of the following compression options for pre-processing:
 
 Allow any lossless compression that will be suitable for production use.
 In Server mode allow per-Query compression.
-|Generative | Text to image | SDXL | Allow one of the following compression options:
+|Generative | Text to image | SDXL | No compression allowed.
 
-1) No compression 2) Lossless compression
 |===
 
 . Compression scheme needs pre-approval, at least two weeks before a submission deadline.

From c3385db0de2af74a9d3b7c8a3be22acf15a4dc37 Mon Sep 17 00:00:00 2001
From: Zhihan <zhihanj@nvidia.com>
Date: Wed, 10 Jan 2024 10:18:57 -0800
Subject: [PATCH 10/16] Remove low-latency constraints for Llama2; remove ban
 on paged attention;

---
 inference_rules.adoc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index ae3c952..2052d92 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -251,8 +251,8 @@ The Datacenter suite includes the following benchmarks:
 |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
-|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
-|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=28124112)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms for conversationalfootnote:llamalatency[For Llama2, there are 2 latency contraints: conversational and near real-time. The user can choose either (or both) of the contraints, and report the achieved performance number.]; 500 ms/50 ms for near real-timefootnote:llamalatency[]
+|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
+|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=293.3)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |===
 
@@ -877,7 +877,7 @@ A: The entries of the KV-cache should be handled in the same way as the activati
 
 Q: How does query batching affect the KV-cache usage?
 
-A: The size of the KV-cache is determined by the batch size. The KV-cache size can also be cached across queries, in accordance with the rule of allowing caching of sizes and shapes. Other than batching and quantization rules (that apply to activations), alternative attention mechanisms (such as paged, multi-query, sparse, group query attention, etc.) or wholesale replacement of the reference KV-cache execution are not permitted.
+A: The size of the KV-cache is determined by the batch size. The KV-cache size can also be cached across queries, in accordance with the rule of allowing caching of sizes and shapes.
 
 Q: Is it allowed to apply continuous batching (or dynamic batching) for auto-generative benchmarks?
 

From 7cb014e33abf3d17a9ed62954704f08bc1786764 Mon Sep 17 00:00:00 2001
From: Zhihan <zhihanj@nvidia.com>
Date: Thu, 25 Jan 2024 15:18:26 -0800
Subject: [PATCH 11/16] Update llama2 accuracy

---
 inference_rules.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 76ed3de..a403543 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -253,7 +253,7 @@ The Datacenter suite includes the following benchmarks:
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
-|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=43.88, rouge2=21.7108, rougeL=28.2502). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=293.3)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
+|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===

From 630ef8f9b6a3e9b6ef7ff21401e41d035b106ccd Mon Sep 17 00:00:00 2001
From: Yiheng Zhang <yihengz@nvidia.com>
Date: Tue, 6 Feb 2024 13:09:37 -0800
Subject: [PATCH 12/16] fix table format for SDXL and add SDXL bandwidth
 formula

---
 inference_rules.adoc | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index e8c0fdd..c041738 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -284,7 +284,7 @@ The Edge suite includes the following benchmarks:
 |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)
 |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds)| 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%)
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%)
-|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s
+|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
 |===
 
@@ -540,10 +540,10 @@ This rule applies both for the QSL pre-processing and for post-processing functi
 
 1) No compression 2) Lossless compression
 
-|Language | Summarization | GPT-J | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). 
+|Language | Summarization | GPT-J | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).
 
 No compression allowed.
-|Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). 
+|Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).
 
 No compression allowed.
 |Commerce | Recommendation | DLRMv2 | QDL sends query (Batch of samples).
@@ -1033,8 +1033,8 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
 |Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
 |Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
+|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ |  __throughput*77*dtype_size__
 |===
-
 === Egress Bandwidth
 
 Datacenter systems must provide at least the following bandwidths from the output location (e.g. DRAM) to the network or I/O device. The minimum bandwidth is a function of the throughput achieved by the SUT and the output data types. For all models except 3D Unet, the output sizes are negligible. Therefore, for those models, the egress bandwidth must simply be greater than 0.
@@ -1048,5 +1048,5 @@ Datacenter systems must provide at least the following bandwidths from the outpu
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048)  | negligible | negligible | __> 0__
 |Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__
+|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | negligible | negligible | __> 0__
 |===
-

From b04253d93958a50b5e776b82ad4dbae9665bb5fd Mon Sep 17 00:00:00 2001
From: Yiheng Zhang <yihengz@nvidia.com>
Date: Thu, 8 Feb 2024 10:04:00 -0800
Subject: [PATCH 13/16] Update SDXL Egress stat

---
 inference_rules.adoc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index c041738..edbb1a6 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -1037,7 +1037,7 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |===
 === Egress Bandwidth
 
-Datacenter systems must provide at least the following bandwidths from the output location (e.g. DRAM) to the network or I/O device. The minimum bandwidth is a function of the throughput achieved by the SUT and the output data types. For all models except 3D Unet, the output sizes are negligible. Therefore, for those models, the egress bandwidth must simply be greater than 0.
+Datacenter systems must provide at least the following bandwidths from the output location (e.g. DRAM) to the network or I/O device. The minimum bandwidth is a function of the throughput achieved by the SUT and the output data types. For all models except 3D Unet and SDXL, the output sizes are negligible. Therefore, for those models, the egress bandwidth must simply be greater than 0.
 
 |===
 |Area |Model |Dataset | Symbolic input size formula | Numeric input size formula | Minimum network bandwidth (bytes/sec)
@@ -1048,5 +1048,5 @@ Datacenter systems must provide at least the following bandwidths from the outpu
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048)  | negligible | negligible | __> 0__
 |Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__
-|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | negligible | negligible | __> 0__
+|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __3,145,728*dtype_size__ | __throughput*3,145,728*dtype_size__ | __> 0__
 |===

From 562b5ae18e2aa9d5d5e13bd6864d3eb08179c7d3 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Mon, 24 Jun 2024 18:12:27 -0500
Subject: [PATCH 14/16] Add Mixtral rules

---
 inference_rules.adoc | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index edbb1a6..aa752d0 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -180,6 +180,7 @@ Each sample has the following definition:
 |GPT-J	            |one sequence
 |SDXL	            |A pair of postive and negative prompts
 |Llama2	            |one sequence
+|Mixtral-8x7B            |one sequence
 |===
 
 == Benchmarks
@@ -256,6 +257,7 @@ The Datacenter suite includes the following benchmarks:
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
+|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=1024), GSM8K (5k samples of the validation split, max_seq_len=1024), MBXP (5k samples of the validation split, max_seq_len=1024) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===
@@ -353,6 +355,7 @@ For each of the following benchmarks it is necessary to use the following infere
 |Summarization (GPT-J) |max_new_tokens |128 | Maximum number of new tokens to generate
 |Summarization (GPT-J) |early_stopping |True | Use the EOS token to stop generating tokens
 |Summarization (Llama2) |max_new_tokens |1024 | Maximum number of new tokens to generate
+|Text Generation (Mixtral-8x7B) |max_new_tokens |1024 | Maximum number of new tokens to generate
 |===
 
 == Load Generator
@@ -545,6 +548,8 @@ This rule applies both for the QSL pre-processing and for post-processing functi
 No compression allowed.
 |Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).
 
+|Language | Text Generation | Mixtral-8x7B | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation).
+
 No compression allowed.
 |Commerce | Recommendation | DLRMv2 | QDL sends query (Batch of samples).
 
@@ -597,7 +602,7 @@ As input, before preprocessing:
 
 * all imaging benchmarks take uncropped uncompressed bitmap
 
-* BERT, GPT-J, Llama2 take texts
+* BERT, GPT-J, Llama2 and Mixtral-8x7B take texts
 
 * RNN-T takes a waveform
 
@@ -1032,6 +1037,7 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
 |Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
+|Language |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=1024), GSM8K (5k samples of the validation split, max_seq_len=1024), MBXP (5k samples of the validation split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
 |Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
 |Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ |  __throughput*77*dtype_size__
 |===

From 7e773c5e4bdb1a2e43cd48112c1b49da6798eb71 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 25 Jun 2024 11:41:48 -0500
Subject: [PATCH 15/16] Change Mixtral max_seq_len to 2048

---
 inference_rules.adoc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index aa752d0..1e18e47 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -257,7 +257,7 @@ The Datacenter suite includes the following benchmarks:
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
-|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=1024), GSM8K (5k samples of the validation split, max_seq_len=1024), MBXP (5k samples of the validation split, max_seq_len=1024) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
+|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===
@@ -355,7 +355,7 @@ For each of the following benchmarks it is necessary to use the following infere
 |Summarization (GPT-J) |max_new_tokens |128 | Maximum number of new tokens to generate
 |Summarization (GPT-J) |early_stopping |True | Use the EOS token to stop generating tokens
 |Summarization (Llama2) |max_new_tokens |1024 | Maximum number of new tokens to generate
-|Text Generation (Mixtral-8x7B) |max_new_tokens |1024 | Maximum number of new tokens to generate
+|Text Generation (Mixtral-8x7B) |max_new_tokens |2048 | Maximum number of new tokens to generate
 |===
 
 == Load Generator
@@ -1037,7 +1037,7 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
 |Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
-|Language |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=1024), GSM8K (5k samples of the validation split, max_seq_len=1024), MBXP (5k samples of the validation split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
+|Language |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
 |Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__
 |Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ |  __throughput*77*dtype_size__
 |===

From c58dbd6d0447f92d45f4191b449bcc5c36480432 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Mon, 1 Jul 2024 12:12:42 -0500
Subject: [PATCH 16/16] Fix typo in Mixtral accuracy

---
 inference_rules.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 1e18e47..9318ef0 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -257,7 +257,7 @@ The Datacenter suite includes the following benchmarks:
 |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
-|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
+|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===