Skip to content

Commit

Permalink
merge with SCC'23 branch (#1003)
Browse files Browse the repository at this point in the history
  • Loading branch information
gfursin authored Nov 8, 2023
2 parents 1df8175 + 69dfeb4 commit fbb577e
Show file tree
Hide file tree
Showing 2 changed files with 210 additions and 31 deletions.
6 changes: 6 additions & 0 deletions cm-mlops/automation/utils/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -626,6 +626,12 @@ def create_toc_from_md(self, i):

x = title.lower().replace(' ','-')

for k in range(0,2):
if x.startswith('*'):
x=x[1:]
if x.endswith('*'):
x=x[:-1]

for z in [':', '+', '.', '(', ')', ',']:
x = x.replace(z, '')

Expand Down
235 changes: 204 additions & 31 deletions docs/tutorials/scc23-mlperf-inference-bert.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,36 +29,34 @@
* [Run reference MLPerf inference benchmark with ONNX run-time](#run-reference-mlperf-inference-benchmark-with-onnx-run-time)
* [Run short reference MLPerf inference benchmark to measure accuracy (offline scenario)](#run-short-reference-mlperf-inference-benchmark-to-measure-accuracy-offline-scenario)
* [Run short MLPerf inference benchmark to measure performance (offline scenario)](#run-short-mlperf-inference-benchmark-to-measure-performance-offline-scenario)
* [Prepare minimal MLPerf submission to the SCC committee](#prepare-minimal-mlperf-submission-to-the-scc-committee)
* [**Prepare minimal MLPerf submission to the SCC committee**](#prepare-minimal-mlperf-submission-to-the-scc-committee)
* [Optional: publish results at the live SCC'23 dashboard](#optional-publish-results-at-the-live-scc'23-dashboard)
* [Optional: debug reference implementation](#optional-debug-reference-implementation)
* [Optional: extend reference implementation](#optional-extend-reference-implementation)
* [Optional: use another compatible BERT model (for example from the Hugging Face Hub)](#optional-use-another-compatible-bert-model-for-example-from-the-hugging-face-hub)
* [Optional (research): pruning and benchmarking BERT models](#optional-research-pruning-and-benchmarking-bert-models)
* [Optional: use another ML framework](#optional-use-another-ml-framework)
* [PyTorch](#pytorch)
* [TensorFlow](#tensorflow)
* [Optional: use CUDA with reference implementation](#optional-use-cuda-with-reference-implementation)
* [Run optimized implementation of the MLPerf inference BERT benchmark](#run-optimized-implementation-of-the-mlperf-inference-bert-benchmark)
* [Showcase CPU performance (x64 or Arm64)](#showcase-cpu-performance-x64-or-arm64)
* [Run quantized and pruned BERT model (int8) on CPU](#run-quantized-and-pruned-bert-model-int8-on-cpu)
* [Prepare optimized MLPerf submission to the SCC committee](#prepare-optimized-mlperf-submission-to-the-scc-committee)
* [**Prepare optimized MLPerf submission to the SCC committee**](#prepare-optimized-mlperf-submission-to-the-scc-committee)
* [Optional: debug DeepSparse implementation](#optional-debug-deepsparse-implementation)
* [Optional: extend this implementation](#optional-extend-this-implementation)
* [Optional: use another compatible BERT model with DeepSparse backend](#optional-use-another-compatible-bert-model-with-deepsparse-backend)
* [Optional: use another compatible BERT model from the NeuralMagic Zoo directly (fp32)](#optional-use-another-compatible-bert-model-from-the-neuralmagic-zoo-directly-fp32)
* [Showcase Nvidia GPU performance](#showcase-nvidia-gpu-performance)
* [Showcase Nvidia AMD performance](#showcase-nvidia-amd-performance)
* [Optimize benchmark yourself](#optimize-benchmark-yourself)
* [Changing batch size](#changing-batch-size)
* [Adding support for multi-node execution](#adding-support-for-multi-node-execution)
* [Adding new implementation for new hardware](#adding-new-implementation-for-new-hardware)
* [Showcase AMD performance](#showcase-amd-performance)
* [The next steps](#the-next-steps)
* [Acknowledgments](#acknowledgments)
* [Nvidia MLPerf inference backend](#nvidia-mlperf-inference-backend)
* [DeepSparse MLPerf inference backend](#deepsparse-mlperf-inference-backend)

</details>


*This document is still being updated and will be finalized soon!*


Expand Down Expand Up @@ -89,8 +87,8 @@ that you will submit to the SCC organizers to get points.



*An interactive version of the short versionof this tutorial is available
at this [Google colab page](https://colab.research.google.com/drive/1kgw1pdKi8QcCTqPZu1Vh_ur1NOeTRdWJ?usp=sharing)*.
*An interactive version of the short version of this tutorial is available
at this [Google Colab page](https://colab.research.google.com/drive/1kgw1pdKi8QcCTqPZu1Vh_ur1NOeTRdWJ?usp=sharing)*.



Expand Down Expand Up @@ -146,14 +144,14 @@ to obtain the first (min) set of points:
* `mlperf_submission_short.md` - description of your platform and some highlights of the MLPerf benchmark execution.

You will need to submit the following files with the optimized MLPerf BERT inference results
to obtain main points (including major bonus points for improving exisitng benchmark
to obtain main points (including major bonus points for improving existing benchmark
implementations and adding new hardware backends):

* `mlperf_submission_{N}.tar.gz` - automatically generated file with validated MLPerf results.
* `mlperf_submission_{N}_summary.json` - automatically generated summary of MLPerf results.
* `mlperf_submission_{N}.run` - CM commands to run MLPerf BERT inference benchmark saved to this file.
* `mlperf_submission_{N}.tstamps` - execution timestamps before and after CM command saved to this file.
* `mlperf_submission_{N}.md` - your highglights, optimizations, improvements and extensions of the MLPerf BERT inference benchmark
* `mlperf_submission_{N}.md` - your highlights, optimizations, improvements and extensions of the MLPerf BERT inference benchmark
(new hardware backends, support for multi-node execution, batch size, quantization, etc).
Note that you will need to provide a PR with open-source Apache 2.0 improvements
to the [MLCommons inference repo](https://github.com/mlcommons/inference)
Expand Down Expand Up @@ -750,7 +748,7 @@ Note that the MLPerf BERT inference throughput (samples per second) is very low



### Prepare minimal MLPerf submission to the SCC committee
### **Prepare minimal MLPerf submission to the SCC committee**

You are now ready to generate the submission similar to the ones appearing
on the [official MLPerf inference dashboard](https://mlcommons.org/en/inference-edge-31).
Expand Down Expand Up @@ -1075,6 +1073,20 @@ cmr "app mlperf inference generic _python _bert-99 _onnxruntime _cpu" \
--rerun
```

### Optional (research): pruning and benchmarking BERT models

You can check our [reproducibility initiative for ACM/IEEE/NeurIPS conferences](https://cTuning.org/ae),
prune BERT model based on [NeurIPS 2022 paper "A Fast Post-Training Pruning Framework for Transformers"](https://arxiv.org/abs/2204.09656)
using [this CM script](https://github.com/ctuning/cm-reproduce-research-projects/blob/main/script/reproduce-neurips-paper-2022-arxiv-2204.09656/README-extra.md)
and feed a newly pruned BERT model to the MLPerf inference benchmarking using `--env.CM_MLPERF_CUSTOM_MODEL_PATH` flag in the above command.

We also suggest you to check another [related project from Hugging Face](https://github.com/huggingface/nn_pruning),
add CM interface and test it with the MLPerf inference benchmark.

Please contact us via the public [Discord server](https://discord.gg/JjWNWXKxwT) to participate in this collaborative R&D.




### Optional: use another ML framework

Expand Down Expand Up @@ -1122,6 +1134,32 @@ from [Zenodo](https://zenodo.org/record/3939747) (~1.3GB) and plug it into the C

### Optional: use CUDA with reference implementation

You can install or detect CUDA drivers, toolkit and cuDNN via CM as follows:
```bash
cmr "install prebuilt-cuda _driver"
cmr "get cuda _cudnn"
```

You can print info about CUDA devices via CM as follows:
```bash
cmr "get cuda-devices"
```

You can now run MLPerf inference benchmark with PyTorch and CUDA as follows:

```bash
cmr "app mlperf inference generic _python _bert-99 _pytorch _cuda" \
--scenario=Offline \
--mode=performance \
--execution-mode=test \
--test_query_count=10 \
--adr.mlperf-implementation.tags=_repo.https://github.com/ctuning/inference,_branch.scc23 \
--adr.mlperf-implementation.version=custom \
--adr.compiler.tags=gcc \
--quiet \
--rerun
```




Expand All @@ -1131,7 +1169,7 @@ from [Zenodo](https://zenodo.org/record/3939747) (~1.3GB) and plug it into the C
Now you are ready to run optimized implementations of the MLPerf inference benchmark
for the hardware that you want to showcase at SCC'23.

You will get extra points propotional
You will get extra points proportional
to the MLPerf BERT inference throughput obtained on your system. You will also get the major
bonus points for any improvements to the MLPerf inference implementation including support
for new hardware such as AMD GPUs.
Expand Down Expand Up @@ -1183,17 +1221,17 @@ cmr "run mlperf inference generate-run-cmds _submission _short" \
--clean
```

#### Prepare optimized MLPerf submission to the SCC committee
#### **Prepare optimized MLPerf submission to the SCC committee**

You will need to submit the following files with the optimized MLPerf BERT inference results
to obtain main points (including major bonus points for improving exisitng benchmark
to obtain main points (including major bonus points for improving existing benchmark
implementations and adding new hardware backends):

* `mlperf_submission_{N}.tar.gz` - automatically generated file with validated MLPerf results.
* `mlperf_submission_{N}_summary.json` - automatically generated summary of MLPerf results.
* `mlperf_submission_{N}.run` - CM commands to run MLPerf BERT inference benchmark saved to this file.
* `mlperf_submission_{N}.tstamps` - execution timestamps before and after CM command saved to this file.
* `mlperf_submission_{N}.md` - your highglights, optimizations, improvements and extensions of the MLPerf BERT inference benchmark
* `mlperf_submission_{N}.md` - your highlights, optimizations, improvements and extensions of the MLPerf BERT inference benchmark
(new hardware backends, support for multi-node execution, batch size, quantization, etc).
Note that you will need to provide a PR with open-source Apache 2.0 improvements
to the [MLCommons inference repo](https://github.com/mlcommons/inference)
Expand Down Expand Up @@ -1274,7 +1312,7 @@ used to prepare MLPerf inference v3.1 submissions with multiple BERT model varia

#### Optional: use another compatible BERT model from the NeuralMagic Zoo directly (fp32)

You can find and use any compatible model from the NueralMagic Zoo with the MLPerf inference benchmark via CM as follows:
You can find and use any compatible model from the NeuralMagic Zoo with the MLPerf inference benchmark via CM as follows:

```bash
cmr "run mlperf inference generate-run-cmds _submission _short" \
Expand All @@ -1289,8 +1327,8 @@ cmr "run mlperf inference generate-run-cmds _submission _short" \
--adr.mlperf-inference-implementation.max_batchsize=128 \
--env.CM_MLPERF_NEURALMAGIC_MODEL_ZOO_STUB=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni \
--quiet \
--output_tar=mlperf_submission_1.tar.gz \
--output_summary=mlperf_submission_1_summary \
--output_tar=mlperf_submission_3.tar.gz \
--output_summary=mlperf_submission_3_summary \
--clean
```

Expand All @@ -1300,33 +1338,169 @@ used to prepare MLPerf inference v3.1 submissions with multiple BERT model varia
* https://github.com/mlcommons/ck/blob/master/cm-mlops/script/run-all-mlperf-models/run-pruned-bert.sh








### Showcase Nvidia GPU performance

Please follow [this README](https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/bert/README_nvidia.md)
to run the MLPerf BERT inference benchmark on Nvidia GPU(s).

A summary of CM commands you may need to run Nvidia's implementation of the MLPerf inference benchmark
while adapting to your environment unless you want to use a container
(note that Nvidia setup will be interactive and you will need answer a few questions about your system):

```bash
cmr "install prebuilt-cuda _driver"

cmr "get cudnn" --tar_file={full path to the cuDNN tar file downloaded from https://developer.nvidia.com/cudnn}

cmr "get tensorrt _dev" --tar_file={full path to the TensorRT tar file downloaded from https://developer.nvidia.com/tensorrt-download}

nohup time cmr "generate-run-cmds inference _find-performance" \
--submitter="SCC23" \
--hw_name=default \
--implementation=nvidia-original \
--model=bert-99 \
--backend=tensorrt \
--device=cuda \
--scenario=Offline \
--category=edge \
--division=open \
--quiet \
--clean

cmr "generate-run-cmds inference _submission" \
--submitter="SCC23" \
--hw_name=default \
--implementation=nvidia-original \
--model=bert-99 \
--backend=tensorrt \
--device=cuda \
--scenario=Offline \
--category=edge \
--division=open \
--execution-mode=valid \
--quiet \
--output_tar=mlperf_submission_1.tar.gz \
--output_summary=mlperf_submission_1_summary \
--clean


```

Note, that since CM attempts to adapt MLPerf to your environment, your combination of dependencies
may not have been tested by the community and may sometimes fail. In such case, please
report issues [here](https://github.com/mlcommons/ck/issues) to help the community
continuously and collaboratively improve CM workflows and make them more portable
(that's why we called our automation language "Collective Mind").

For example, you may often need to uprage protobuf to the latest version until the community adds a better
handling of the protobuf version to the CM-MLPerf pipeline:
```bash
pip install --upgrade protobuf
```

#### **Prepare optimized MLPerf submission to the SCC committee**

You will need to submit the following files with the optimized MLPerf BERT inference results
to obtain more points proportional to your performance (in comparison with other teams
using Nvidia GPUs):

* `mlperf_submission_{N}.tar.gz` - automatically generated file with validated MLPerf results.
* `mlperf_submission_{N}_summary.json` - automatically generated summary of MLPerf results.
* `mlperf_submission_{N}.run` - CM commands to run MLPerf BERT inference benchmark saved to this file.
* `mlperf_submission_{N}.tstamps` - execution timestamps before and after CM command saved to this file.
* `mlperf_submission_{N}.md` - description of your submission

where N is your attempt number out of 5.

You will need to get in touch with Nvidia if you want to optimize this submission further.

For example, our Nvidia colleagues shared the following suggestions that may improve performance of their implementations wrapped and unified by CM:
* change the version of TRT
* tune the config files
* check this [performance guide](https://github.com/mlcommons/inference_results_v3.1/blob/main/closed/NVIDIA/documentation/performance_tuning_guide.md)




### Showcase Nvidia AMD performance

### Showcase AMD performance


There is a pilot CM support to run MLPerf BERT inference on AMD GPU
with [ROCm](https://www.amd.com/en/graphics/servers-solutions-rocm).

There is a pilot project to run MLPerf BERT inference on AMD GPU.
We are testing it and plan to add to CM workflows soon.
You can test it as follows:

```bash
cmr "app mlperf inference generic _python _bert-99 _onnxruntime" \
--scenario=Offline \
--mode=performance \
--device=rocm \
--execution-mode=test \
--test_query_count=10 \
--rerun \
--adr.mlperf-implementation.tags=_repo.https://github.com/ctuning/inference,_branch.scc23 \
--adr.mlperf-implementation.version=custom \
--adr.compiler.tags=gcc \
--quiet
```

You can then prepare a submission as follows:

```
cmr "run mlperf inference generate-run-cmds _submission _short" \
--submitter="SCC23" \
--hw_name=default \
--implementation=reference \
--model=bert-99 \
--backend=onnxruntime \
--device=rocm \
--scenario=Offline \
--execution-mode=test \
--test_query_count=1000 \
--adr.mlperf-implementation.tags=_repo.https://github.com/ctuning/inference,_branch.scc23 \
--adr.mlperf-implementation.version=custom \
--quiet \
--output_tar=mlperf_submission_2.tar.gz \
--output_summary=mlperf_submission_2_summary \
--clean
```


You will see a long output that should contain the following line with accuracy
(to make sure that MLPerf works properly):
```bash
{"exact_match": 70.0, "f1": 70.0}
```
Please get in touch with [the community via Discord server](https://discord.gg/JjWNWXKxwT)
to help test it.
if you encounter issues or would like to extend it!


#### **Prepare optimized MLPerf submission to the SCC committee**

## Optimize benchmark yourself
You will need to submit the following files with the optimized MLPerf BERT inference results
to obtain main points (including major bonus points for improving existing benchmark
implementations and adding new hardware backends):

### Changing batch size
* `mlperf_submission_{N}.tar.gz` - automatically generated file with validated MLPerf results.
* `mlperf_submission_{N}_summary.json` - automatically generated summary of MLPerf results.
* `mlperf_submission_{N}.run` - CM commands to run MLPerf BERT inference benchmark saved to this file.
* `mlperf_submission_{N}.tstamps` - execution timestamps before and after CM command saved to this file.
* `mlperf_submission_{N}.md` - your highlights, optimizations, improvements and extensions of the MLPerf BERT inference benchmark
(new hardware backends, support for multi-node execution, batch size, quantization, etc).
Note that you will need to provide a PR with open-source Apache 2.0 improvements
to the [MLCommons inference repo](https://github.com/mlcommons/inference)
our our [stable fork](https://github.com/ctuning/inference).

### Adding support for multi-node execution
where N is your attempt number out of 5.

### Adding new implementation for new hardware



Expand All @@ -1340,11 +1514,10 @@ and the [cTuning foundation](https://cTuning.org) continue working with the comm
to enable universal benchmarking of AI/ML systems across any model, data set, software and hardware
using CM and loadgen. We are also developing a universal Python and C++ harness
to make it easier to plug in different models, data sets, frameworks and hardware backends
together with a [user-friendly GUI/platform](https://cknowledge.org/mlperf-inference-gui) to run this benchmark.
Please join our [Discord server](https://discord.gg/JjWNWXKxwT) to provide your feedback and participate in these community developments!


together with a [user-friendly GUI/platform](https://cknowledge.org/mlperf-inference-gui) to run, compare and reproduce ML(Perf) benchmarks.
We welcome other MLPerf and CM extensions including support for multi-node execution, better implementations, optimizations and new hardware backends.

Please join our [Discord server](https://discord.gg/JjWNWXKxwT) to provide your feedback and participate in these community developments!

## Acknowledgments

Expand Down

0 comments on commit fbb577e

Please sign in to comment.