- 1. Overview
- 2. General rules
- 2.1. Strive to be fair
- 2.2. System and framework must be consistent
- 2.3. System and framework must be available
- 2.4. Benchmark implementations must be shared
- 2.5. Non-determinism is restricted
- 2.6. Benchmark detection is not allowed
- 2.7. Input-based optimization is not allowed
- 2.8. Replicability is mandatory
- 3. Scenarios
- 4. Benchmarks
- 5. Load Generator
- 6. Divisions
- 7. Data Sets
- 8. Model
- 9. FAQ
- 10. Appendix Number of Queries
Version 0.5 April 16th, 2019
Points of contact: David Kanter ([email protected]), Vijay Janapa Reddi ([email protected])
This document describes how to implement one or more benchmarks in the MLPerf Inference Suite and how to use those implementations to measure the performance of an ML system performing inference.
The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.
The following definitions are used throughout this document:
A sample is the unit on which inference is run. E.g., an image, or a sentence.
A query is a set of N samples that are issued to an inference system together. N is a positive integer. For example, a single query contains 8 images.
Quality always refers to a model’s ability to produce “correct” outputs.
A system under test consists of a defined set of hardware and software resources that will be measured for performance. The hardware resources may include processors, accelerators, memories, disks, and interconnect. The software resources may include an operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark.
A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization. The reference implementation is the canonical implementation of a benchmark. All valid submissions of a benchmark must be equivalent to the reference implementation.
A run is a complete execution of a benchmark implementation on a system under the control of the load generator that consists of completing a set of inference queries, including data pre- and post-processing, meeting a latency requirement and a quality requirement in accordance with a scenario.
A run result consists of the scenario-specific metric.
The following rules apply to all benchmark implementations.
Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.
The same system and framework must be used for a suite result or set of benchmark results reported in a single context.
If you are measuring the performance of a publicly available and widely-used system or framework, you must use publicly available and widely-used versions of the system or framework.
If you are measuring the performance of an experimental framework or system, you must make the system and framework you use available upon demand for replication.
Source code used for the benchmark implementations must be open-sourced under a license that permits a commercial entity to freely use the implementation for benchmarking. The code must be available as long as the results are actively used.
The only forms of acceptable non-determinism are:
-
Floating point operation order
-
Random traversal of the inputs
-
Rounding
All random numbers must be based on a fixed random seed and deterministic random number generator. The fixed random number generator is the Mersenne Twister 19937 generator (std::mt19937). The fixed seed will be announced a week before the benchmark submission deadline.
The framework and system should not detect and behave differently for benchmarks.
The implementation should not encode any information about the content of the input dataset in any form.
In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described in the table below.
Scenario |
Query Generation |
Duration |
Samples/query |
Latency Constraint |
Tail Latency |
Performance Metric |
Single stream |
LoadGen sends next query as soon as SUT completes the previous query |
1024 queries and 60 seconds |
1 |
None |
90% |
90%-ile measured latency |
Multiple stream |
LoadGen sends a new query every latency constraint if the SUT has completed the prior query, otherwise the new query is dropped and is counted as one overtime query |
24,576 queries and 60 seconds |
Variable, see metric |
Benchmark specific |
90% |
Maximum number of inferences per query supported |
Server |
LoadGen sends new queries to the SUT according to a Poisson distribution |
270,336 queries for image models and 90,112 otherwise and 60 seconds |
1 |
Benchmark specific |
99% for image models and 97% for otherwise |
Maximum Poisson throughput parameter supported |
Offline |
LoadGen sends all queries to the SUT at start |
1 query and 60 seconds |
At least 24,576 |
None |
N/A |
Measured throughput |
The number of queries is selected to ensure sufficient statistical confidence in the reported metric. Specifically, the top line in the following table. Lower lines are being evaluated for future versions of MLPerf Inference (e.g., 95% tail latency for v0.6 and 99% tail latency for v0.7).
Tail Latency Percentile |
Confidence Interval |
Margin-of-Error |
Inferences |
Rounded Inferences |
90% |
99% |
0.50% |
23,886 |
3*2^13 = 24,576 |
95% |
99% |
0.25% |
50,425 |
7*2^13 = 57,344 |
97% |
99% |
0.15% |
85,811 |
11*2^13 = 90,112 |
99% |
99% |
0.05% |
262,742 |
33*2^13 = 270,336 |
A submission may comprise any combination of benchmark and scenario results.
The MLPerf organization provides a reference implementation of each benchmark, which includes the following elements: Code that implements the model in a framework. A plain text “README.md” file that describes:
-
Problem
-
Dataset/Environment
-
Publication/Attribution
-
Data pre- and post-processing
-
Performance, accuracy, and calibration data sets
-
Test data traversal order (CHECK)
-
-
Model
-
Publication/Attribution
-
List of layers
-
Weights and biases
-
-
Quality and latency
-
Quality target
-
Latency target(s)
-
-
Directions
-
Steps to configure machine
-
Steps to download and verify data
-
Steps to run and time
-
A “download_dataset” script that downloads the accuracy, speed, and calibration datasets.
A “verify_dataset” script that verifies the dataset against the checksum.
A “run_and_time” script that executes the benchmark and reports the wall-clock time.
The benchmark suite consists of the benchmarks shown in the following table. Quality and latency targets are still being finalized.
Area |
Task |
Model |
Dataset |
Quality |
Server latency constraint |
Multi-Stream latency constraint |
Vision |
Image classification |
Resnet50-v1.5 |
ImageNet (224x224) |
99% of FP32 (76.46%) |
15 ms |
50 ms |
Vision |
Image classification |
MobileNets-v1 224 |
ImageNet (224x224) |
99% of FP32 (71.68%) |
10 ms |
66 ms |
Vision |
Object detection |
SSD-ResNet34 |
COCO (1200x1200) |
99% of FP32 (0.20 mAP) |
100 ms |
50 ms |
Vision |
Object detection |
SSD-MobileNets-v1 |
COCO (300x300) |
99% of FP32 (0.23 mAP) |
10 ms |
50 ms |
Language |
Machine translation |
GMNT |
WMT16 |
99% of FP32 (23.9 BLEU) |
250 ms |
100 ms |
The LoadGen is provided in C++ with Python bindings and must be used by all submissions. The LoadGen is responsible for:
-
Generating the queries according to one of the scenarios.
-
Tracking the latency of queries.
-
Validating the accuracy of the results.
-
Computing final metrics.
Latency is defined as the time from LoadGen passing a query to the SUT, to the time it receives a reply.
-
Single-stream: LoadGen measures average latency using a single test run. For the test run, LoadGen sends an initial query then continually sends the next query as soon as the previous query is processed.
-
Multi-stream: LoadGen determines the maximum supported number of streams using multiple test runs. Each test run evaluates a specific integer number of streams. For a specific number of streams, queries are generated with a number of samples per query equal to the number of streams tested. All samples in a query will be allocated contiguously in memory. LoadGen will use a binary search to find a candidate value. It will then verify stability by testing the value 5 times. If one run fails, it will reduce the number of streams by one and then try again.
-
Server: LoadGen determines the system throughput using multiple test runs. Each test run evaluates a specific throughput value in queries-per-second (QPS). For a specific throughput value, queries are generated at that QPS using a Poisson distribution. LoadGen will use a binary search to find a candidate value. It will then verify stability by testing the value 5 times. If one run fails, it will reduce the value by a small delta then try again.
-
Offline: LoadGen measures throughput using a single test run. For the test run, LoadGen sends all queries at once.
The run procedure is as follows:
-
LoadGen signals system under test (SUT).
-
SUT starts up and signals readiness.
-
LoadGen starts clock and begins generating queries.
-
LoadGen stops generating queries as soon as the benchmark-specific minimum number of queries have been generated and the benchmark specific minimum time has elapsed.
-
LoadGen waits for all queries to complete, and errors if all queries fail to complete.
-
LoadGen computes metrics for the run.
The execution of LoadGen is restricted as follows:
-
LoadGen must run on the processor that most faithfully simulates queries arriving from the most logical source, which is usually the network or an I/O device such as a camera. For example, if the most logical source is the network and the system is characterized as host - accelerator, then LoadGen should run on the host unless the accelerator incorporates a NIC.
-
The trace generated by LoadGen must be stored in the non-HBM DRAM that most faithfully simulates queries arriving from the most logical source, which is usually the network or an I/O device such as a camera. It may be pinned. Submitters need prior approval for anything that is not DRAM.
-
Caching of any queries, any query parameters, or any intermediate results is prohibited.
LoadGen generates queries based on trace. The trace is constructed by uniformly sampling (with replacement) from a library based on a fixed random seed and deterministic generator. The trace is usually pre-generated, but may optionally be incrementally generated if it does not fit in memory. LoadGen validates accuracy via a separate test run that use each sample in the test library exactly once but is otherwise identical to the above normal metric run.
Note: For v0.5, the same code must be run for both the accuracy and performance LoadGen modes.
There are two divisions of the benchmark suite, the Closed division and the Open division.
The Closed division requires using pre-processing, post-processing, and model that is equivalent to the reference or alternative implementation. The closed division allows calibration for quantization and does not allow any retraining.
The unqualified name “MLPerf” must be used when referring to a Closed Division suite result, e.g. “a MLPerf result of 4.5.”
For each benchmark, MLPerf will provide pointers to:
-
An accuracy data set, to be used to determine whether a submission meets the quality target, and used as a validation set
-
A speed/performance data set that is a subset of the accuracy data set to be used to measure performance
For each benchmark except GNMT, MLPerf will provide pointers to:
-
A calibration data set, to be used for quantization (see quantization section), that is a small subset of the training data set used to generate the weights
Each reference implementation shall include a script to verify the datasets using a checksum. The dataset must be unchanged at the start of each run.
All imaging benchmarks take uncropped uncompressed bitmap as inputs, NMT takes text.
Sample-independent pre-processing that matches the reference model is untimed. However, it must be pre-approved and added to the following list:
-
May resize to processed size (e.g. SSD-large)
-
May reorder channels / do arbitrary transpositions
-
May pad to arbitrary size (don’t be creative)
-
May do a single, consistent crop
-
Mean subtraction and normalization provided reference model expect those to be done
-
May convert data among all the whitelisted numerical formats
Any other pre- and post-processing time is included in the wall-clock time for a run result.
CLOSED: For v0.5, MLPerf provides a reference implementation in a first framework and an alternative implementation in a second framework in accordance with the table below. The benchmark implementation must use a model that is equivalent to the reference implementation or the alternative implementation, as defined by the remainder of this section.
Area |
Task |
Model |
Reference implementation |
Alternative implementation |
Vision |
Image classification |
Resnet50-v1.5 |
TensorFlow |
PyTorch/ONNX |
Vision |
Image classification |
MobileNets-v1 224 |
TensorFlow/TensorFlow Lite |
PyTorch/ONNX |
Vision |
Object detection |
SSD-ResNet34 |
PyTorch/ONNX |
TensorFlow/TensorFlow Lite |
Vision |
Object detection |
SSD-MobileNets-v1 |
TensorFlow |
PyTorch/ONNX |
Language |
Machine translation |
GMNT |
TensorFlow |
PyTorch/ONNX |
OPEN: The benchmark implementation may use a different model to perform the same task. Retraining is allowed.
CLOSED: MLPerf will provide trained weights and biases in fp32 format for both the reference and alternative implementations.
MLPerf will provide a calibration data set for all models except GNMT. Submitters may do arbitrary purely mathematical, reproducible quantization using only the calibration data and weight and bias tensors from the benchmark owner provided model to any combination of permissive whitelist numerical format that achieves the desired quality. The quantization method must be publicly described at a level where it could be reproduced. The whitelist currently includes:
-
INT8
-
INT16
-
UINT8
-
UINT16
-
FP11 (1-bit sign, 5-bit exponent, 5-bit mantissa)
-
FP16
-
bfloat16
-
FP32
To be considered principled, the description of the quantization method must be much much smaller than the non-zero weights it produces.
Calibration is allowed and must only use the calibration data set provided by the benchmark owner.
Additionally, for image classification using MobileNets-v1 224 and object detection using SSD-MobileNets-v1, MLPerf will provide a retrained INT8 (asymmetric for TFLite and symmetric for pyTorch/ONNX) model. Model weights and input activations are scaled per tensor, and must preserve the same shape modulo padding. Convolution layers are allowed to be in either NCHW or NHWC format. No other retraining is allowed.
OPEN: Weights and biases must be initialized to the same values for each run, any quantization scheme is allowed that achieves the desired quality.
All implementations are allowed as long as the latency and accuracy bounds are met and the reference weights are used. Reference weights may be modified according to the quantization rules.
Examples of legal variance in implementations include, but are not limited to:
-
Arbitrary frameworks and runtimes: TensorFlow, TensorFlow-lite, ONNX, PyTorch, etc, provided they conform to the rest of the rules
-
Running any given control flow or operations on or off an accelerator
-
Arbitrary data arrangement
-
Different input and in-memory representations of weights
-
Variation in matrix-multiplication or convolution algorithm provided the algorithm produces asymptotically accurate results when evaluated with asymptotic precision
-
Mathematically equivalent transformations (e.g. Tanh versus Logistic, ReluX versus ReluY, any linear transformation of an activation function)
-
Approximations (e.g. replacing a transcendental function with a polynomial)
-
Processing queries out-of-order within discretion provided by scenario
-
Replacing dense operations with mathematically equivalent sparse operations
-
Hand picking different numerical precisions for different operations
-
Fusing or unfusing operations
-
Dynamically switching between one or more batch sizes
-
Different implementations based on dynamically determined batch size
-
Mixture of experts combining differently quantized weights
-
Stochastic quantization algorithms with seeds for reproducibility.
-
Reducing ImageNet classifiers with 1001 classes to 1000 classes
-
Dead code elimination
For anything else you want on this list contact submitters five weeks prior to the submission deadline.
Examples of illegal variance in implementations include, but are not limited to:
-
Wholesale weight replacement or supplements
-
Discarding non-zero weight elements, including pruning
-
Caching queries or responses
-
Coalescing identical queries
-
Modifying weights during the timed portion of an inference run (no online learning or related techniques)
-
Weight quantization algorithms that are similar in size to the non-zero weights they produce
-
Hard coding the total number of queries
-
Techniques that boost performance for fixed length experiments but are inapplicable to long-running services except in the offline scenario
-
Using knowledge of the LoadGen implementation to predict upcoming lulls or spikes in the server scenario
Q: Do I have to use the reference implementation framework?
A: No, you can use another framework provided that it matches the reference in the required areas.
Q: Do I have to use the reference implementation scripts?
A: No, you don’t have to use the reference scripts. The reference is there to settle conformance questions - with a few exceptions, a submission to the closed division must match what the reference is doing.
Q: Why does a run require so many individual inference queries?
A: The numbers were selected to be sufficiently large to statistically verify that the system meets the latency requirements.
Q: For my submission, I am going to use a different model format (e.g., ONNX vs TensorFlow Lite). Should the conversion routine/script be included in the submission? Or is it sufficient to submit the converted model?
A: The goal is reproducibility, so you should include the conversion routine/scripts.
Q: Is it permissible to exceed both the minimum number of queries and minimum time duration in a valid test run?
A: Yes.
Q: Is non-maximal suppression (NMS) timed?
A: Yes. NMS is a per image operation. NMS is used to make sure that in object detection, a particular object is identified only once. Production systems need NMS to ensure high-quality inference.
Q: Is COCO eval timed?
A: No. COCO eval compares the proposed boxes and classes in all the images against ground truth in COCO dataset. COCO eval is not possible in production.
In order to be statistically valid, a certain number of queries are necessary to verify a given latency-bound performance result. How many queries are necessary? Every query either meets the latency bound or exceeds the latency bound. The math for determining the appropriate sample size for a latency bound throughput experiment is exactly the same as determining the appropriate sample size for an electoral poll given an infinite electorate. Three variables determine the sample size: the tail latency percentage, confidence, and margin. Confidence is the probability that a latency bound is within a particular margin of the reported result.
A 99% confidence bound was somewhat arbitrarily selected. For systems with noisy latencies, it is possible to obtain better MLPerf results by cherry picking the best runs. Approximately 1 in 100 runs will be marginally better. Please don’t do this. It is very naughty and will make the MLPerf community feel sad.
The margin should be set to a value much less than the difference between the tail latency percentage and one. Conceptually, the margin ought to be small compared to the distance between the tail latency percentile and 100%. A margin of 0.5% was selected. This margin is one twentieth of the difference between the tail latency percentage and one. In the future, when the tail latency percentage rises, the margin should fall by a proportional amount. The full equation is:
Margin = (1 - TailLatency) / 20
NumQueries = NormsInv((1 - Confidence) / 2)^2 * (TailLatency)(TailLatency - 1) / Margin^2
Concretely:
NumQueries = NormsInv((1 - 0.99) / 2)^2 * (0.9)(1 - 0.9) / 0.005^2 = NormsInv(0.005)^2 * 3600 = (-2.58)^2 * 3,600 = 23,886
To keep the numbers nice, the sample sizes are rounded up. Here is a table showing proposed sample sizes for subsequent rounds of MLPerf:
Tail Latency Percentile |
Confidence Interval |
Margin-of-Error |
Inferences |
Rounded Inferences |
90% |
99% |
0.50% |
23,886 |
3*2^13 = 24,576 |
95% |
99% |
0.25% |
50,425 |
7*2^13 = 57,344 |
99% |
99% |
0.05% |
262,742 |
33*2^13 = 270,336 |
These are mostly for the Server scenario which has tight bounds for tail latency. The other scenario may continue to use lower samples sizes.