- Set up AWS credentials, following instructions on "기타 - DVC & AWS S3 설정".
- Install Conda: https://docs.anaconda.com/free/anaconda/install/index.html
git clone --branch v4.1-internal https://github.com/furiosa-ai/inference.git
cd inference
# (optional, for mlperf loadgen) if GCC Compiler is not installed on Ubuntu,
apt-get update && apt-get install build-essential -y
# (optional, for Stable Diffusion only) it requires cv2 related devian packages to run Stable Diffusion
DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install libgl1 libglib2.0-0 -y
-
Evaluation Result(v3.13.2)
Our Result Accuracy Target (99%) F1 91.0563 (100.20%*) 89.9653 *The percentage is calculated as follows:
(Our Result [Accuracy] / Reference Accuracy) * 100
. This formula applies to all the cases listed below.
To reproduce the results, run the following commands:
. scripts/build_qbert_env.sh # Skip this step if the environment is already set up.
make qbert
For evaluations with different settings, modify the following environment variables:
SCENARIO
: The MLPerf benchmark scenario. Possible values areOffline
,SingleStream
,MultiStream
, orServer
. (default:Offline
)N_COUNT
: The number of data samples to evaluate. (range:[1, 10833]
)CALIBRATE
: Specifies whether to perform calibration during evaluation. (default:false
)N_CALIB
: The number of data samples to use for calibrating the quantized model. (range:[1, 100]
)
-
Evaluation Result(v3.13.2)
Our Result Accuracy Target (99.9%) ROUGE1 43.0305 (100.10%) 42.9435 ROUGE2 20.1437 (100.10%) 20.1034 ROUGEL 30.0211 (100.11%) 29.9581 GEN_LEN 3,978,315 (99.04%) 3,615,191* *GEN_LEN is required to be at least 90% of the Reference.
To reproduce the results, run the following commands:
. scripts/build_qgpt-j_env.sh # Skip this step if the environment is already set up.
make qgpt-j
For evaluations with different settings, modify the following environment variables:
SCENARIO
: The MLPerf benchmark scenario. Possible values areOffline
,SingleStream
, orServer
. (default:Offline
)N_COUNT
: The number of data samples to evaluate. (range:[1, 13368]
)CALIBRATE
: Specifies whether to perform calibration during evaluation. (default:false
)N_CALIB
: The number of data samples to use for calibrating the quantized model. (range:[1, 1000]
)
-
Evaluation Result(v3.13)
Our Result Accuracy Target (99.9%) ROUGE1 TBA 44.3868 ROUGE2 TBA 22.0132 ROUGEL TBA 28.5876 TOKENS_PER_SAMPLE TBA 265.005
To reproduce the results, run the following commands:
. scripts/build_qllama2-70b_env.sh # Skip this step if the environment is already set up.
make qllama2
For evaluations with different settings, modify the following environment variables:
SCENARIO
: The MLPerf benchmark scenario. Possible values areOffline
,Server
. (default:Offline
)N_COUNT
: The number of data samples to evaluate. (range:[1, 24576]
)CALIBRATE
: Specifies whether to perform calibration during evaluation. (default:false
)N_CALIB
: The number of data samples to use for calibrating the quantized model. (range:[1, 1000]
)
End-to-end(E2E) evaluation is the process of downloading models and dataset, building a Python environment, and performing model accuracy evaluation. E2E scripts are developed based on f9a643c.
To run E2E evaluation:
make [model_name]
or equivalently,
bash scripts/build_[model_name]_env.sh
bash scripts/eval_[model_name].sh
-
model_name
includes [resnet, retinanet, 3d-unet, bert, gpt-j, rnnt, llama2, stablediffusion, all] -
For example, to run E2E ResNet evaluation
make resnet
or
# build conda environment and download dataset bash scripts/build_resnet_env.sh # run evaluation on pre-built conda environment bash scripts/eval_resnet.sh
Some parameters are configurable, for example,
-
llama2-70b
The command
make llama2
is equivalent toexport SCENARIO=Offline # SCENARIO is one of [Offline, Server] export N_COUNT=24576 # N_COUNT is a number between [1, 24576] export DATA_TYPE=float32 # DATA_TYPE is one of [float32, float16, bfloat16] export DEVICE=cuda:0 # DEVICE is one of [cpu, cuda:0] make llama2
Each environment variable above has the value as default, which can be changed to another.
Likewise,
-
stable-diffusion-xl-base
export SCENARIO=Offline # SCENARIO is one of [Offline, SingleStream, MultiStream, Server] export N_COUNT=5000 # N_COUNT is a number between [1, 5000] export DATA_TYPE=fp32 # DATA_TYPE is one of [fp32, fp16, bf16] export DEVICE=cuda # DEVICE is one of [cpu, cuda] make stablediffusion
In this section, we present our evaluation result(our result) and mlperf reference accuracy(reference accuracy) in a comparative manner. It is expected that this will allow us to clearly see the goals we need to achieve and our current status.
Note that all models are in Pytorch framework with float32 data type, and all experiments were conducted in Offline scenario.
-
llama2-70b
benchmark our result reference accuracy ROUGE1 44.4312 (100.00%*) 44.4312 ROUGE2 22.0352 (100.00%) 22.0352 ROUGEL 28.6162 (100.00%) 28.6162 TOKENS_PER_SAMPLE 294.4** (99.85%) 294.45 *
round(our result / reference accuracy, 2)
**
294.4
, our result, is rounded at the second decimal place, whereas294.4462890625
is actual. It is estimated that the reference accuracy is rounded to the 3rd decimal place. -
gpt-j
rngd_gelu (RNGD optimized GELU)
our result reference accuracy ROUGE1 42.9865 (100.00%) 42.9865 ROUGE2 20.1036 (99.90%) 20.1235 ROUGEL 29.9737 (99.95%) 29.9881 GEN_LEN 4017766 (100.02%) 4016878 gelu_new (MLPerf reference GELU)
our result reference accuracy ROUGE1 42.9865 (100.00%) 42.9865 ROUGE2 20.1235 (100.00%) 20.1235 ROUGEL 29.9881 (100.00%) 29.9881 GEN_LEN 4016878 (100.00%) 4016878 -
bert
our result reference accuracy F1 90.87487229720105 (100.00%) 90.874
-
resnet
our result reference accuracy accuracy(pytorch backend) acc(%) 76.144 (99.59% / 100.17%)* 76.46 76.014 * There are differences between the types of model framework. That is, our model used for our result is Pytorch backended, and the one used for reference accuracy is Tensorflow. The reference accuracy of Pytorch backended model can be checked in the accuracy(pytorch backend) item.
-
retinanet
our result reference accuracy mAP(%) 37.552 (100.01%) 37.55 -
3d-unet
our result reference accuracy DICE 0.86173 (100.00%) 0.86170
-
rnnt
our result reference accuracy WER(%) 7.459018241673995 7.452 100-WER(%) 92.5409817583 (99.99%*) 92.548 * Reason unknown.
-
dlrm-v2
our result reference accuracy AUC TBA* 80.31 * 8 GPUs are needed for evaluation.
-
stable-diffusion-xl-base
our result reference accuracy CLIP_SCORE 31.74663716465235 (100.19%) 31.68631873 FID_SCORE 23.431448173651063 (101.83%) 23.01085758 * The gap between our result and reference accuracy is quite large. There is a possibility that the published weight and the one used in the experiment are different. It is neccesary to update evaluation result later.
- Default settings:
- scenario: Offline
- model framework: pytorch
- data type: f32
- Device info:
- GPU: 1 NVIDIA A100-SXM4-80GB
- CPU: Intel(R) Xeon(R) Platinum 8358 CPU
model name | our result | mlperf result | input shape* | dataset |
---|---|---|---|---|
resnet | 76.144%(top1 Acc.) | 76.014%(top1 Acc.) | 1x3x224x224(NxCxHxW) | Imagenet2012 validation (num_data: 50,000) |
retinanet | 0.3755(mAP) | 0.3755(mAP) | 1x3x800x800(NxCxHxW) | MLPerf Openimages (num_data: 24,781) |
3d-unet | 0.86173(Dice) | 0.86170(Dice) | 1x1x128x128x128(NxCxDxHxW) | eval set of KiTS 2019 (num_data: 2,761) |
bert | 90.874%(F1) | 90.874%(F1) | 1x384(NxS) | SQuAD v1.1 validation set (num_data: 10,833) |
gpt-j | 42.9865(Rouge1) | 42.9865(Rouge1) | 1x1919(NxS) | CNN-Daily Mail (num_data: 13,368) |
rnnt | 7.45901%(WER) | 74.45225%(WER) | 500x1x240(SxNxF) | OpenSLR LibriSpeech Corpus (num_data: 2,513) |
dlrm-v2 | TBA | 80.31%(AUC) | TBA | Criteo Terabyte (day 23) (num_data: TBA) |
* Shape of preprocessed(transformed/tokenized) input. Notations:
- N: Batch size
- C: input Channel dimension
- H: Height dimension
- W: Width dimension
- D: Depth dimension
- S: max Sequence length
- F: input Feature dimension
To get verified evaluation log:
# (optional) if not installed,
pip install dvc[s3]
make log_[model_name]
model_name
includes [resnet, retinanet, 3d-unet, bert, gpt-j, rnnt, all]- For example, with
the evaluation log of ResNet will be pulled to
make log_resnet
logs/internal/resnet
.
-
Default settings:
- scenario: Offline
- model framework: pytorch
-
LLaMA2-70b
data type our result mlperf result elapsed time device float32 44.4312(Rouge1) 44.4312(Rouge1) 6.6 days 6 H100 PCIe float16 44.4362(Rouge1) - 3.5 days 3 A100-SXM4-80GB bfloat16 44.4625(Rouge1) - 3.6 days 3 A100-SXM4-80GB -
Stable Diffusion XL base
data type our result mlperf result elapsed time device float32 31.7466(clip_score) 31.6863(clip_score) 24 hours 1 GeForce RTX 3090 float16 31.7558(clip_score) - 7.5 hours 1 GeForce RTX 3090 bfloat16 31.7380(clip_score) - 8.3 hours 1 GeForce RTX 3090
To get verified evaluation log:
# (optional) if not installed,
pip install dvc[s3]
make log_[model_name]
- model_name includes [llama2, stablediffusion, all]
- For example, with
the evaluation logs of LLaMA2-70b will be pulled to logs/internal/llama2-70b.
make log_llama2
MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios.
Please see the MLPerf Inference benchmark paper for a detailed description of the benchmarks along with the motivation and guiding principles behind the benchmark suite. If you use any part of this benchmark (e.g., reference implementations, submissions, etc.), please cite the following:
@misc{reddi2019mlperf,
title={MLPerf Inference Benchmark},
author={Vijay Janapa Reddi and Christine Cheng and David Kanter and Peter Mattson and Guenther Schmuelling and Carole-Jean Wu and Brian Anderson and Maximilien Breughe and Mark Charlebois and William Chou and Ramesh Chukka and Cody Coleman and Sam Davis and Pan Deng and Greg Diamos and Jared Duke and Dave Fick and J. Scott Gardner and Itay Hubara and Sachin Idgunji and Thomas B. Jablin and Jeff Jiao and Tom St. John and Pankaj Kanwar and David Lee and Jeffery Liao and Anton Lokhmotov and Francisco Massa and Peng Meng and Paulius Micikevicius and Colin Osborne and Gennady Pekhimenko and Arun Tejusve Raghunath Rajan and Dilip Sequeira and Ashish Sirasao and Fei Sun and Hanlin Tang and Michael Thomson and Frank Wei and Ephrem Wu and Lingjie Xu and Koichi Yamada and Bing Yu and George Yuan and Aaron Zhong and Peizhao Zhang and Yuchen Zhou},
year={2019},
eprint={1911.02549},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Please see here for the MLPerf inference documentation website which includes automated commands to run MLPerf inference benchmarks using different implementations.
For submissions, please use the master branch and any commit since the 4.1 seed release although it is best to use the latest commit. v4.1 tag will be created from the master branch after the result publication.
For power submissions please use SPEC PTD 1.10 (needs special access) and any commit of the power-dev repository after the code-freeze
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx, tvm, ncnn | imagenet2012 | edge,datacenter |
retinanet 800x800 | vision/classification_and_detection | pytorch, onnx | openimages resized to 800x800 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm-v2 | recommendation/dlrm_v2 | pytorch | Multihot Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
gpt-j | language/gpt-j | pytorch | CNN-Daily Mail | edge,datacenter |
stable-diffusion-xl | text_to_image | pytorch | COCO 2014 | edge,datacenter |
llama2-70b | language/llama2-70b | pytorch | OpenOrca | datacenter |
mixtral-8x7b | language/mixtral-8x7b | pytorch | OpenOrca, MBXP, GSM8K | datacenter |
- Framework here is given for the reference implementation. Submitters are free to use their own frameworks to run the benchmark.
There is an extra one-week extension allowed only for the llama2-70b submissions. For submissions, please use the master branch and any commit since the 4.0 seed release although it is best to use the latest commit. v4.0 tag will be created from the master branch after the result publication.
For power submissions please use SPEC PTD 1.10 (needs special access) and any commit of the power-dev repository after the code-freeze
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx, tvm, ncnn | imagenet2012 | edge,datacenter |
retinanet 800x800 | vision/classification_and_detection | pytorch, onnx | openimages resized to 800x800 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm-v2 | recommendation/dlrm_v2 | pytorch | Multihot Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
gpt-j | language/gpt-j | pytorch | CNN-Daily Mail | edge,datacenter |
stable-diffusion-xl | text_to_image | pytorch | COCO 2014 | edge,datacenter |
llama2-70b | language/llama2-70b | pytorch | OpenOrca | datacenter |
- Framework here is given for the reference implementation. Submitters are free to use their own frameworks to run the benchmark.
Please use v3.1 tag (git checkout v3.1
) if you would like to reproduce the v3.1 results.
For reproducing power submissions please use the master
branch of the MLCommons power-dev repository and checkout to e9e16b1299ef61a2a5d8b9abf5d759309293c440.
You can see the individual README files in the benchmark task folders for more details regarding the benchmarks. For reproducing the submitted results please see the README files under the respective submitter folders in the inference v3.1 results repository.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx, tvm, ncnn | imagenet2012 | edge,datacenter |
retinanet 800x800 | vision/classification_and_detection | pytorch, onnx | openimages resized to 800x800 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm-v2 | recommendation/dlrm_v2 | pytorch | Multihot Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
gpt-j | language/gpt-j | pytorch | CNN-Daily Mail | edge,datacenter |
Please use the v3.0 tag (git checkout v3.0
) if you would like to reproduce v3.0 results.
You can see the individual Readme files in the reference app for more details.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx, tvm | imagenet2012 | edge,datacenter |
retinanet 800x800 | vision/classification_and_detection | pytorch, onnx | openimages resized to 800x800 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm | recommendation/dlrm | pytorch, tensorflow | Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
Use the r2.1 branch (git checkout r2.1
) if you want to submit or reproduce v2.1 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx | imagenet2012 | edge,datacenter |
retinanet 800x800 | vision/classification_and_detection | pytorch, onnx | openimages resized to 800x800 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm | recommendation/dlrm | pytorch, tensorflow | Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
Use the r2.0 branch (git checkout r2.0
) if you want to submit or reproduce v2.0 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx | imagenet2012 | edge,datacenter |
ssd-mobilenet 300x300 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 | edge |
ssd-resnet34 1200x1200 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm | recommendation/dlrm | pytorch, tensorflow | Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet-kits19 | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
Use the r1.1 branch (git checkout r1.1
) if you want to submit or reproduce v1.1 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx | imagenet2012 | edge,datacenter |
ssd-mobilenet 300x300 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 | edge |
ssd-resnet34 1200x1200 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm | recommendation/dlrm | pytorch, tensorflow | Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet | pytorch, tensorflow(?), onnx(?) | BraTS 2019 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
Use the r1.0 branch (git checkout r1.0
) if you want to submit or reproduce v1.0 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset | category |
---|---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, onnx | imagenet2012 | edge,datacenter |
ssd-mobilenet 300x300 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 | edge |
ssd-resnet34 1200x1200 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 | edge,datacenter |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
dlrm | recommendation/dlrm | pytorch, tensorflow(?) | Criteo Terabyte | datacenter |
3d-unet | vision/medical_imaging/3d-unet | pytorch, tensorflow(?), onnx(?) | BraTS 2019 | edge,datacenter |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
Use the r0.7 branch (git checkout r0.7
) if you want to submit or reproduce v0.7 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset |
---|---|---|---|
resnet50-v1.5 | vision/classification_and_detection | tensorflow, pytorch, onnx | imagenet2012 |
ssd-mobilenet 300x300 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 |
ssd-resnet34 1200x1200 | vision/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 |
bert | language/bert | tensorflow, pytorch, onnx | squad-1.1 |
dlrm | recommendation/dlrm | pytorch, tensorflow(?), onnx(?) | Criteo Terabyte |
3d-unet | vision/medical_imaging/3d-unet | pytorch, tensorflow(?), onnx(?) | BraTS 2019 |
rnnt | speech_recognition/rnnt | pytorch | OpenSLR LibriSpeech Corpus |
Use the r0.5 branch (git checkout r0.5
) if you want to reproduce v0.5 results.
See the individual Readme files in the reference app for details.
model | reference app | framework | dataset |
---|---|---|---|
resnet50-v1.5 | v0.5/classification_and_detection | tensorflow, pytorch, onnx | imagenet2012 |
mobilenet-v1 | v0.5/classification_and_detection | tensorflow, pytorch, onnx | imagenet2012 |
ssd-mobilenet 300x300 | v0.5/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 300x300 |
ssd-resnet34 1200x1200 | v0.5/classification_and_detection | tensorflow, pytorch, onnx | coco resized to 1200x1200 |
gnmt | v0.5/translation/gnmt/ | tensorflow, pytorch | See Readme |