MLPerf Inference Benchmarks for Natural Language Processing

This is the reference implementation for MLPerf Inference benchmarks for Natural Language Processing.

The chosen model is BERT-Large performing SQuAD v1.1 question answering task.

Prerequisites

model	framework	accuracy	dataset	model link	model source	precision	notes
BERT-Large	TensorFlow	f1_score=90.874%	SQuAD v1.1 validation set	from zenodo from zenodo	BERT-Large, trained with NVIDIA DeepLearningExamples	fp32
BERT-Large	PyTorch	f1_score=90.874%	SQuAD v1.1 validation set	from zenodo	BERT-Large, trained with NVIDIA DeepLearningExamples, converted with bert_tf_to_pytorch.py	fp32
BERT-Large	ONNX	f1_score=90.874%	SQuAD v1.1 validation set	from zenodo	BERT-Large, trained with NVIDIA DeepLearningExamples, converted with bert_tf_to_pytorch.py	fp32
BERT-Large	ONNX	f1_score=90.067%	SQuAD v1.1 validation set	from zenodo	Fine-tuned based on the PyTorch model and converted with bert_tf_to_pytorch.py	int8, symetrically per-tensor quantized without bias	See [MLPerf INT8 BERT Finetuning.pdf](MLPerf INT8 BERT Finetuning.pdf) for details about the fine-tuning process

This benchmark app is a reference implementation that is not meant to be the fastest implementation possible.

Please run the following commands:

make setup: initialize submodule, download datasets, and download models.
make build_docker: build docker image.
make launch_docker: launch docker container with an interaction session.
python3 run.py --backend=[tf|pytorch|onnxruntime|tf_estimator] --scenario=[Offline|SingleStream|MultiStream|Server] [--accuracy] [--quantized]: run the harness inside the docker container. Performance or Accuracy results will be printed in console.

SUT implementations are in tf_SUT.py, tf_estimator_SUT.py and pytorch_SUT.py. QSL implementation is in squad_QSL.py.
The script accuracy-squad.py parses LoadGen accuracy log, post-processes it, and computes the accuracy.
Tokenization and detokenization (post-processing) are not included in the timed path.
The inputs to the SUT are input_ids, input_make, and segment_ids. The output from SUT is start_logits and end_logits concatenated together.
max_seq_length is 384.
The script [tf_freeze_bert.py] freezes the TensorFlow model into pb file.
The script [bert_tf_to_pytorch.py] converts the TensorFlow model into the PyTorch BertForQuestionAnswering module in HuggingFace Transformers and also exports the model to ONNX format.

Apache License 2.0