diff --git a/inference_rules.adoc b/inference_rules.adoc index 796f7d3..327cbf2 100644 --- a/inference_rules.adoc +++ b/inference_rules.adoc @@ -106,6 +106,8 @@ In each round, up to two submissions will be audited: one at random from all sub The process of random selection is in two stages: first a submitter is randomly chosen from all submitters with auditable submissions, then one of those submissions is randomly chosen. A submission is not a candidate for the randomly chosen audit if the system is equivalent to a system audited in the previous round. For the purposes of this rule, equivalent systems have the same CPU, NIC, accelerator, and accelerator count, with the same configuration of those components as per the system configuration JSON. For LoadGen Over Network submission the Networking must be the same. The review committee may determine that additional systems are equivalent to those audited in a previous round and exempt them from random audit. As a guidance for this exemption, if an accelerator is audited in one of the previous rounds, then the systems using the same accelerator can be excluded from random audit, if the aggregate system performance and the performance per accelerator are not more than 10% from those submitted during last audit time. For systems with power metrics, in addition to the performance, power efficiency must also be within 10% from the last audit time to be eligible for an exclusion from random audit. If any new result like a new model, an additional non-inferred scenario measurement or a new power measurement is submitted from the last audit time, then the exclusion is not applicable unless the review committee decides otherwise. +If a submitter chosen for an audit finds it unfair, they can appeal to the MLCommons Executive Director to ensure fairness. + During the review process, a github issue shall be opened where submitters can nominate systems for audit. Each nomination shall contain a reason, such as new HW or SW, unusual or interesting features, performance outside of expectations, etc. Review committee chairs evaluate the nominations and compile a list of systems at the end of the review period. Any systems with new accelerators are added to the list by the chairs if not nominated. The review committee will select a submission for audit by ranked choice voting using a simple majority. An option "No Selected Audit This Round" may be added if requested by a majority of the review committee. An auditor shall be chosen by the review committee who has no conflict of interest with the submitter. The process of auditor selection will take no more than 28 days from selection of the submitter. @@ -121,7 +123,7 @@ The submitter will provide the auditor an NDA within seven days of the auditor's The auditor will submit their report to the submitter no more than thirty days after executing all relevant NDAs. The submitter will make any necessary redactions due to NDAs and forward the finalized report to the review committee within seven days. The auditor will confirm the accuracy of the forwarded report. Submissions that fail the audit at a material level will be moved to open or removed, by review committee decision. -If a submission failed an audit that was delayed past publication, then any published material concerning the invalidated result is subject to the MLCommons [rules for Violation Determination, Remedies and Penalties](https://github.com/mlcommons/policies/blob/master/MLPerf_Results_Messaging_Guidelines.adoc#12-violation-determination-remedies-and-penalties) for remedial action. +If a submission failed an audit that was delayed past publication, then any published material concerning the invalidated result is subject to the MLCommons [rules for Violation Determination, Remedies and Penalties](https://github.com/mlcommons/policies/blob/master/MLPerf_Results_Messaging_Guidelines.adoc#12-violation-determination-remedies-and-penalties) for remedial action. MLCommons shall retain a library of past audit reports and send copies to MLCommons members, auditors, and potential auditors by request. Audit reports will not be further distributed without permission from the audited submitter. @@ -178,6 +180,9 @@ Each sample has the following definition: |BERT |one sequence |DLRMv2 |up to 700 user-item pairs (more details in FAQ) |GPT-J |one sequence +|SDXL |A pair of postive and negative prompts +|Llama2 |one sequence +|Mixtral-8x7B |one sequence |=== == Benchmarks @@ -252,20 +257,26 @@ The Datacenter suite includes the following benchmarks: |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms -|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s +|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s +|Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms +|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms +|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s |=== Each Datacenter benchmark *requires* the following scenarios: |=== -|Area |Task |Required Scenarios +|Area |Task |Required Scenarios |Vision |Image classification |Server, Offline |Vision |Object detection |Server, Offline |Vision |Medical image segmentation |Offline |Speech |Speech-to-text |Server, Offline |Language |Language processing |Server, Offline +|Language |Summarization |Server, Offline +|Language |Question Answering |Server, Offline |Commerce |Recommendation |Server, Offline +|Generative |Text to image |Server, Offline |=== The Edge suite includes the following benchmarks: @@ -277,7 +288,8 @@ The Edge suite includes the following benchmarks: |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) |Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds)| 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) |Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%) -|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)| 20 s +|Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878) +|Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] |=== Each Edge benchmark *requires* the following scenarios, and sometimes permit an optional scenario: @@ -289,21 +301,23 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an |Vision |Medical image segmentation |Single Stream, Offline |Speech |Speech-to-text |Single Stream, Offline |Language |Language processing |Single Stream, Offline +|Generative |Text to image |Single Stream, Offline +|Language |Summarization |Single Stream, Offline |=== Edge submitters are allowed to infer a multistream result from single stream, and -an offline result from either a single stream result or a measured multistream result, +an offline result from either a single stream result or a measured multistream result, according to the following rules: - a multistream result inferred from a single stream result is 8 times the 99th percentile latency reported by loadgen. For example, if the single stream 99%th percentile latency is 25ms, the inferred multistream result is 200ms. - an offline result inferred from a multistream result is 8000 divided by the mean latency in milliseconds. For example, -if the multistream result is 200ms, the inferred offline result is 40 img/s. +if the multistream result is 200ms, the inferred offline result is 40 img/s. - an offline result inferred from a single stream result is 1000 divided by the mean latency in milliseconds. For example, -if the single stream result is 25ms, the inferred offline result is 40 img/s. +if the single stream result is 25ms, the inferred offline result is 40 img/s. The accuracy of an inferred result will be the same as the result from which it was inferred. When inferring a metric for the power table, the measured power used to calculate the metric is the same as for the base result @@ -319,7 +333,7 @@ replacement) from a test set. The minimum size of the performance test set for each benchmark is listed as 'QSL Size' in the table above. However, the accuracy test must be run with one copy of the MLPerf specified validation dataset. -For 3DUNet, the logical destination for the benchmark output is considered to be the network. +For 3DUNet, the logical destination for the benchmark output is considered to be the network. ==== Relaxed constraints for the Open division @@ -342,6 +356,8 @@ For each of the following benchmarks it is necessary to use the following infere |Summarization (GPT-J) |min_new_tokens |30 | Minimun number of new tokens to generate |Summarization (GPT-J) |max_new_tokens |128 | Maximum number of new tokens to generate |Summarization (GPT-J) |early_stopping |True | Use the EOS token to stop generating tokens +|Summarization (Llama2) |max_new_tokens |1024 | Maximum number of new tokens to generate +|Text Generation (Mixtral-8x7B) |max_new_tokens |2048 | Maximum number of new tokens to generate |=== == Load Generator @@ -404,14 +420,14 @@ The execution of LoadGen is restricted as follows: network and the system is characterized as host - accelerator, then LoadGen should run on the host unless the accelerator incorporates a NIC. -* The trace generated by LoadGen must be stored in the DRAM that most faithfully simulates queries arriving - from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned. - Similarly, the response provided to Loadgen must be stored in the DRAM that most faithfully simulates transfer - to the most logical destination, which is a CPU process unless otherwise specified for the benchmark. - From 4.0, submitters must provide with their submission sufficient details of the system architecture and software to +* The trace generated by LoadGen must be stored in the DRAM that most faithfully simulates queries arriving + from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned. + Similarly, the response provided to Loadgen must be stored in the DRAM that most faithfully simulates transfer + to the most logical destination, which is a CPU process unless otherwise specified for the benchmark. + From 4.0, submitters must provide with their submission sufficient details of the system architecture and software to show how the I/O bandwidth utilized by each benchmark/scenario combination can be transferred between the memory where the trace is stored and the network or I/O device. Minimum bandwidths for each benchmark can be found in <>. All components mentioned in the system architecture must be present in the system during the run. A system architecture description must be provided along with the submission, which must include: - + ** Bandwidth of each NIC and total number of NIC(s) ** Description of the data path from the NIC(s) to the accelerator(s) ** Specifications or measurements indicating that the path from the NIC to the memory in which loadgen data resides can sustain the required bandwidth @@ -431,10 +447,10 @@ optionally be incrementally generated if it does not fit in memory. LoadGen validates accuracy via a separate test run that use each sample in the test library exactly once but is otherwise identical to the above normal metric run. -One LoadGen validation run is required for each submitted performance result +One LoadGen validation run is required for each submitted performance result even if two or more performance results share the same source code. -Note: The same code must be run for both the accuracy and performance LoadGen modes. This means the same output should be passed in QuerySampleComplete in both modes. +Note: The same code must be run for both the accuracy and performance LoadGen modes. This means the same output should be passed in QuerySampleComplete in both modes. == Divisions @@ -465,10 +481,10 @@ Non-conforming network submission should be submitted to Open category, under th * The QDL is not allowed to pad the data in queries. * The QDL is not allowed to cache queries or responses. * The QDL is implementing the network function of the LoadGen Node towards the SUT node and handles the required processing. E.G. padding of the payload as required by the network protocol. -* The QDL should reflect a single SUT to the LoadGen. LoadGen operates with a single SUT. +* The QDL should reflect a single SUT to the LoadGen. LoadGen operates with a single SUT. * The Name method's return value must contain the substring "Network SUT". * The Name method's implementation must include at least one round trip over the network. The Name method must not return until the round trip is complete. -* The QDL must query each SUT Node for its name and aggregate the responses in the Name Method. Each SUT Node must have a unique name. +* The QDL must query each SUT Node for its name and aggregate the responses in the Name Method. Each SUT Node must have a unique name. The submission must include source code for the QDL implementation above the level of the OSI session layer (RPC or equivalent), and sufficient documentation of the session layer API that a reader of that code can understand what data is being marshalled and sent over the network for each query. @@ -498,9 +514,9 @@ Fabric and protocol must be reported in the submission metadata. Submission meta * SUT parameters and configuration must be uniquely and specifically named in the submission results. * Everything outside the LoadGen node should be considered as part of the SUT, for instance for counting power and latency. As an example, components outside the nodes like a switch or load balancer should be considered part of the SUT. -* All queries must be transferred over the network, carrying the inference data, for inference execution at the SUT. All responses must be transferred back over the network, carrying the inference responses. +* All queries must be transferred over the network, carrying the inference data, for inference execution at the SUT. All responses must be transferred back over the network, carrying the inference responses. * Caching/Storing of the queries and inference data or responses for further use at the SUT is disallowed. It is allowed to cache/store other data like Neural Network weights or Neural Network executable. -* SUT can do the required pre-processing of the data, e.g. Batching, Padding, processing of the requests (precision, data layout), compression, decompression. SUT can do the required post processing functions e.g. gather, reduction or ArgMax. +* SUT can do the required pre-processing of the data, e.g. Batching, Padding, processing of the requests (precision, data layout), compression, decompression. SUT can do the required post processing functions e.g. gather, reduction or ArgMax. * The report must contain network interface characteristics for both the Loadgen and SUT systems, and every other component through which data passes between Loadgen and SUT. The information must be sufficient for reproducibility. * A system diagram must be included in the submission that shows how the components between the LoadGen node and the SUT nodes are connected, accompanied by any text necessary for another submitter to understand the diagram. * For "Available" submissions, for reproducibility, it is required to specify software version of all components, hardware configurations, software stacks, dockers, and settings of all components and stacks. @@ -516,7 +532,7 @@ Data formats for inputs and outputs are allowed to be compressed for network tra 1) No compression 2) Lossless compression 3) The original compression of the dataset (JPEG) |Vision | Object detection (large) | Retinanet | Allow one of the following compression options for pre-processing: -1) No compression 2) Lossless compression 3) The original compression of the dataset (For the Coco dataset JPEG, for Open Images JPEG) +1) No compression 2) Lossless compression 3) The original compression of the dataset (For the Coco dataset JPEG, for Open Images JPEG) |Vision | Medical image segmentation | 3D UNET | Allow one of the following compression options: 1) No compression 2) Lossless compression @@ -525,10 +541,16 @@ This rule applies both for the QSL pre-processing and for post-processing functi |Speech | Speech-to-text | RNNT | Allow one of the following compression options for pre-processing: 1) No compression 2) Lossless compression 3) The original compression of the dataset (FLAC) -|Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). +|Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 1) No compression 2) Lossless compression -|Language | Language processing | GPT-J | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). + +|Language | Summarization | GPT-J | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). + +No compression allowed. +|Language | Question Answering | Llama2 | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). + +|Language | Text Generation | Mixtral-8x7B | Input is either Token IDs, Input Masks and Input Lengths or just the Token IDs (the other tensors are generated at the SUT in a timed operation). No compression allowed. |Commerce | Recommendation | DLRMv2 | QDL sends query (Batch of samples). @@ -539,6 +561,8 @@ Allow one of the following compression options for pre-processing: Allow any lossless compression that will be suitable for production use. In Server mode allow per-Query compression. +|Generative | Text to image | SDXL | No compression allowed. + |=== . Compression scheme needs pre-approval, at least two weeks before a submission deadline. @@ -552,7 +576,7 @@ including retraining. The qualified name “MLPerf Open” must be used when referring to an Open Division suite result, e.g. “a MLPerf Open result of 7.2.” https://github.com/mlperf/inference_policies/blob/master/inference_retraining_rules.adoc[Restricted retraining rules] -characterize a subset of Open division retraining possibilities that are expected to be straightforward for customers to use. +characterize a subset of Open division retraining possibilities that are expected to be straightforward for customers to use. The restrictions are optional; conformance will be indicated by a tag on the submission. == Data Sets @@ -580,7 +604,7 @@ As input, before preprocessing: * all imaging benchmarks take uncropped uncompressed bitmap -* BERT takes text +* BERT, GPT-J, Llama2 and Mixtral-8x7B take texts * RNN-T takes a waveform @@ -602,6 +626,8 @@ untimed. However, it must be pre-approved and added to the following list: * May convert data among numerical formats +* May convert to token ids from texts using the reference tokenizer + Any other pre- and post-processing time is included in the wall-clock time for a run result. @@ -621,7 +647,7 @@ task. Retraining is allowed. === Weight Definition and Quantization -CLOSED: MLPerf will provide trained weights and biases in fp32 format for both +CLOSED: MLPerf will provide trained weights and biases in fp16/fp32 format for both the reference and alternative implementations. MLPerf will provide a calibration data set for all models. @@ -699,7 +725,7 @@ Examples of allowed techniques include, but are not limited to: * Empirical performance and accuracy tuning based on the performance and accuracy set (eg. selecting batch sizes or numerics experimentally) - + * Sorting an embedding table based on frequency of access in the training set. (Submitters should include in their submission details of how the ordering was derived.) @@ -727,7 +753,7 @@ The following techniques are disallowed: * Using knowledge of the LoadGen implementation to predict upcoming lulls or spikes in the server scenario - + * Treating beams in a beam search differently. For example, employing different precision for different beams @@ -742,6 +768,8 @@ The following techniques are disallowed: * Techniques that only improve performance when there are identical samples in a query. For example, sorting samples in SSD. +* Speculative decoding for auto-generative language models (i.e. using a smaller model to predict the next token for the reference model). + == FAQ Q: Do I have to use the reference implementation framework? @@ -757,7 +785,7 @@ division must match what the reference is doing. Q: Can I submit a single benchmark (e.g., object detection) in a suite (e.g., data center), or do I have to submit all benchmarks? -A: You can submit any of the benchmarks that are interesting, from just one benchmark to the entire set of benchmarks. Keep in mind that submitting one benchmark typically requires running several scenarios as described in Section 4. For example, submitting object detection in the data center suite requires the server and offline scenario and submitting object detection in the edge suite requires the single stream and offline scenarios. +A: You can submit any of the benchmarks that are interesting, from just one benchmark to the entire set of benchmarks. Keep in mind that submitting one benchmark typically requires running several scenarios as described in Section 4. For example, submitting object detection in the data center suite requires the server and offline scenario and submitting object detection in the edge suite requires the single stream and offline scenarios. Q: Why does a run require so many individual inference queries? @@ -821,32 +849,32 @@ A: For all scenarios, the distribution of user-item pairs per sample is specifie Q: What is https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_trace_verification.txt[dist_trace_verification.txt]? -The benchmark provides a pre-defined quantile distribution in `./tools/dist_quantile.txt` from which the samples will be drawn using the inverse transform algorithm. This algorithm relies on randomly drawn numbers from the interval [0,1) and that depend on the `--numpy-rand-seed`, which specific value will be provided shortly before MLPerf inference submissions. +The benchmark provides a pre-defined quantile distribution in `./tools/dist_quantile.txt` from which the samples will be drawn using the inverse transform algorithm. This algorithm relies on randomly drawn numbers from the interval [0,1) and that depend on the `--numpy-rand-seed`, which specific value will be provided shortly before MLPerf inference submissions. Q: What is the rational for the distribution of user-item pairs? -In the case of DLRMv2 we have agreed that we should use multiple samples drawn from a distribution, similar to the one shown on Fig. 5: "Queries for personalized recommendation models" in the https://arxiv.org/abs/2001.02772[DeepRecSys] paper. +In the case of DLRMv2 we have agreed that we should use multiple samples drawn from a distribution, similar to the one shown on Fig. 5: "Queries for personalized recommendation models" in the https://arxiv.org/abs/2001.02772[DeepRecSys] paper. Q: Generating dlrm_trace_of_aggregated_samples.txt uses a pseudo-random number generator. How can submitters verify their system pseudo-random number generator is compatible? -Submitters can verify their compatibility by using the default `--numpy-rand-seed` and comparing the trace generated on their system with `./tools/dist_trace_verification.txt` using the following command -``` -./run_local.sh pytorch dlrm terabyte cpu --count-samples=100 --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=128 +Submitters can verify their compatibility by using the default `--numpy-rand-seed` and comparing the trace generated on their system with `./tools/dist_trace_verification.txt` using the following command +``` +./run_local.sh pytorch dlrm terabyte cpu --count-samples=100 --scenario Offline --max-ind-range=40000000 --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=128 ``` Q: I understand that `--samples-to-aggregate-quantile-file=./tools/dist_quantile.txt` is the only compliant setting for MLPerf, but what are the alternative settings and what do they do? -The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive samples together into a single aggregated sample. The number of samples to be aggregated can be selected using either of the following options +The DLRMv2 MLPerf inference code has an option to aggregate multiple consecutive samples together into a single aggregated sample. The number of samples to be aggregated can be selected using either of the following options -1. fixed [`--samples-to-aggregate-fix`] -2. drawn uniformly from interval [`--samples-to-aggregate-min`, `--samples-to-aggregate-max`] +1. fixed [`--samples-to-aggregate-fix`] +2. drawn uniformly from interval [`--samples-to-aggregate-min`, `--samples-to-aggregate-max`] 3. drawn from a custom distribution, with its quantile (inverse of CDP) specified in `--samples-to-aggregate-quantile-file=./tools/dist_quantile.txt`. === LLM Benchmarks Q: What algorithm is used for the auto-regressive decoding loop? -A: The benchmark uses the beam search algorithm described at a high level here: https://huggingface.co/blog/how-to-generate#beam-search. Specifically, we use a beam width of 4 and enable early termination. +A: The algorithms used by the benchmarks (greedy search and beam search) are described at a high level here: https://huggingface.co/blog/how-to-generate. Specifically, GPT-J uses a beam width of 4 and enable early termination, while Llama2 uses greedy search. Q: MLPerf disallows caching queries. Is using a KV-cache in decoding allowed? @@ -854,7 +882,11 @@ A: Using a KV-cache is allowed in the same way as it is included in the referenc Q: Is it allowed to not use a KV-cache or use it partially? -A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cache, but if you do, your implementation must adhere to the reference implementation. If you do not use a KV-cache, the corresponding values must be rematerialized during the decoding process. +A: Yes, KV-cache is an optional optimization. It is not required to use a KV-cache, but if you do, your implementation must adhere to the reference implementation. If you do not use a KV-cache, the corresponding values must be rematerialized during the decoding process. + +Q: Is it allowed to store continuous keys and values in non-contiguous memory space for the KV-cache, i.e. PagedAttention? + +A: Yes, it is allowed as long as the KV-cache block is reused only within the batch of queries. A high level explanation of PagedAttention can be found here: https://blog.vllm.ai/2023/06/20/vllm.html. Q: How does quantization and pruning apply to the KV-cache? @@ -862,7 +894,11 @@ A: The entries of the KV-cache should be handled in the same way as the activati Q: How does query batching affect the KV-cache usage? -A: The size of the KV-cache is determined by the batch size. The KV-cache size can also be cached across queries, in accordance with the rule of allowing caching of sizes and shapes. Other than batching and quantization rules (that apply to activations), alternative attention mechanisms (such as paged, multi-query, sparse, group query attention, etc.) or wholesale replacement of the reference KV-cache execution are not permitted. +A: The size of the KV-cache is determined by the batch size. The KV-cache size can also be cached across queries, in accordance with the rule of allowing caching of sizes and shapes. + +Q: Is it allowed to apply continuous batching (or dynamic batching) for auto-generative benchmarks? + +A: Yes. Continuous batching is explained at a high level here: https://www.anyscale.com/blog/continuous-batching-llm-inference. === Audit @@ -885,7 +921,7 @@ A: You should expect to provide the following: The auditor may also request source code access to binary elements of the submission software. Where information or access is not provided, the auditor's report will list the issues that could not be resolved. Q: Is it expected that an audit will be concluded during the review period? -A: No. We should try to finish the audit before the publication date. +A: No. We should try to finish the audit before the publication date. [[appendix-early_stopping]] [appendix] @@ -896,9 +932,9 @@ The early stopping criterion allows for systems to process a smaller number of q === Motivating Example -Processing more queries allows us to better estimate the percentage of the time a system passes a given latency bound, p. However, if p is particularly high, then with fewer queries we will have a larger margin-of-error, but will still be statistically confident that it is above the required threshold. Because the benchmark threshold is what we really care about (and not closely estimating p), early stopping allows submitters to process fewer queries in such cases. +Processing more queries allows us to better estimate the percentage of the time a system passes a given latency bound, p. However, if p is particularly high, then with fewer queries we will have a larger margin-of-error, but will still be statistically confident that it is above the required threshold. Because the benchmark threshold is what we really care about (and not closely estimating p), early stopping allows submitters to process fewer queries in such cases. -Suppose we have a benchmark that requires that submissions achieve a given latency bound 90% of the time. We have system A which achieves this latency bound 99% of the time, and system B which achieves it 91% of the time. In order to have a 99% confidence interval with a margin-of-error of 0.50%, we must perform 23,886 inferences. +Suppose we have a benchmark that requires that submissions achieve a given latency bound 90% of the time. We have system A which achieves this latency bound 99% of the time, and system B which achieves it 91% of the time. In order to have a 99% confidence interval with a margin-of-error of 0.50%, we must perform 23,886 inferences. This makes sense for system B (whose underlying probability, 91%, is very close to the required benchmark percentile of 90%). However, assuming we see close to 99% of the queries passing the latency requirement for system A, we will be 99% sure that the underlying probability of success for a query on A will be within 99% 土 0.50%. This range is well above the requested latency percentile of 90%. Therefore, by performing fewer queries for such a system, we could widen the margin-of-error slightly, while still being statistically certain of being above the latency benchmark. @@ -906,30 +942,30 @@ This makes sense for system B (whose underlying probability, 91%, is very close Suppose we have a system that meets its latency requirement for each query with probability p. What are the odds that we see at least h underlatency queries and at most t overlatency queries? We can answer this by using the cumulative distribution function for the binomial distribution. -We can think of processing queries as performing n Bernoulli trials, with probability of success for any given trial (i.e., odds of being underlatency) equal to p. The probability of exactly k successes (underlatency queries) is equal to: +We can think of processing queries as performing n Bernoulli trials, with probability of success for any given trial (i.e., odds of being underlatency) equal to p. The probability of exactly k successes (underlatency queries) is equal to: f(k; n, p) = P(k successes) = (n choose k) * p^k * (1-p)^(n-k) -For fixed n and p, f(k; n, p) is called the binomial distribution with parameters n and p. +For fixed n and p, f(k; n, p) is called the binomial distribution with parameters n and p. In order to determine how unusual our distribution of latency successes and failures is given the underlying probability of passing the latency bound (p), we compute the probability that we had at most h successes, keeping the total number of queries, n, fixed. This, by definition, involves computing the cumulative density function for our binomial distribution, F(h; n, p): F(h; n, p) = ∑ f(k; n, p), - + with the summation going from k = h to n. -Note that, holding h and n fixed, this probability decreases as p increases. This is because, as p gets larger, the odds that our n queries produced results at least as poor as h successes and t failures decreases. In other words, it is harder to achieve a larger number of failures when the underlying probability of an individual success is higher. +Note that, holding h and n fixed, this probability decreases as p increases. This is because, as p gets larger, the odds that our n queries produced results at least as poor as h successes and t failures decreases. In other words, it is harder to achieve a larger number of failures when the underlying probability of an individual success is higher. This cumulative distribution function for the binomial distribution, F(k; n, p), can be written in terms of the regularized incomplete beta function. The (unregularized) incomplete beta function is defined as: B(x; a, b) = ∫t^(a - 1) * (1-t)^(b-1) dt, -where the integral goes from 0 to x. +where the integral goes from 0 to x. We can regularize this to attain: I(x; a, b) = B(x; a, b) / B(1; a, b). -Note that this is "regularized" in the sense that I(0; a, b) = 0, and I(1; a, b) = 1. +Note that this is "regularized" in the sense that I(0; a, b) = 0, and I(1; a, b) = 1. We have an alternate expression for F(k; n, p) in terms of this function: @@ -989,7 +1025,7 @@ For our implementation, we use: [appendix] == Datacenter Bandwidth Requirements -Datacenter systems must satisfy both the ingress and egress bandwidth requirements for each benchmark. +Datacenter systems must satisfy both the ingress and egress bandwidth requirements for each benchmark. === Ingress Bandwidth Datacenter systems must provide at least the following bandwidths from the network or I/O device to the location where the trace is stored (e.g. DRAM). The minimum bandwidth is a function of the throughput achieved by the SUT and the input data types. The formulas below assume that the inputs are not pre-processed in any way (e.g. padded). If the inputs are pre-processed, and pre-processing affects the input size, submitters must adjust the formulas below accordingly. @@ -1001,13 +1037,15 @@ Datacenter systems must provide at least the following bandwidths from the netwo |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[The average image size above is the average image size of the inference cases specified in https://github.com/mlcommons/inference/blob/master/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json[inference_cases.json].] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__ |Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | __max_audio_duration*num_samples_per_sec*(bits_per_sample/8)__ | __15*16000*(16/8)__ | __throughput*480000__ |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__ -|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __3*2048*dtype_size__ | __throughput*6144*dtype_size__ +|Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__ +|Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__ +|Language |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__ |Commerce |DLRMv2 | 1TB Click Logs |__avg(num_pairs_per_sample)*(num_numerical_inputs*dtype_size~1~ +num_categorical_inputs*dtype_size~2~))__footnote:[Each DLRMv2 sample consists of up to 700 user-item pairs draw from the distribution specified in https://github.com/mlcommons/inference/blob/master/recommendation/dlrm/pytorch/tools/dist_quantile.txt[dist_quantile.txt].] |__270*(13*dtype_size~1~+26*dtype_size~2~)__ | __throughput*270*(13*dtype_size~1~+26*dtype_size~2~)__ +|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __num_inputs*max_prompt_len*dtype_size__ | __77*dtype_size__ | __throughput*77*dtype_size__ |=== - === Egress Bandwidth -Datacenter systems must provide at least the following bandwidths from the output location (e.g. DRAM) to the network or I/O device. The minimum bandwidth is a function of the throughput achieved by the SUT and the output data types. For all models except 3D Unet, the output sizes are negligible. Therefore, for those models, the egress bandwidth must simply be greater than 0. +Datacenter systems must provide at least the following bandwidths from the output location (e.g. DRAM) to the network or I/O device. The minimum bandwidth is a function of the throughput achieved by the SUT and the output data types. For all models except 3D Unet and SDXL, the output sizes are negligible. Therefore, for those models, the egress bandwidth must simply be greater than 0. |=== |Area |Model |Dataset | Symbolic input size formula | Numeric input size formula | Minimum network bandwidth (bytes/sec) @@ -1018,5 +1056,5 @@ Datacenter systems must provide at least the following bandwidths from the outpu |Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__ |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | negligible | negligible | __> 0__ |Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__ +|Generative |SDXL |Subset of coco-2014 val captions (max_prompt_len=77) | __3,145,728*dtype_size__ | __throughput*3,145,728*dtype_size__ | __> 0__ |=== -