From a7a2b43ec308190d91285de8d0838c0d1b491c08 Mon Sep 17 00:00:00 2001
From: Pablo Gonzalez <pablo.gonzalez@factored.ai>
Date: Tue, 22 Oct 2024 09:50:15 -0500
Subject: [PATCH 1/2] Update for v5.0: only edge BERT and remove RNNT

---
 inference_rules.adoc | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index 9318ef0..cb571e7 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -174,7 +174,6 @@ Each sample has the following definition:
 |Resnet50-v1.5	    |one image
 |Retinanet	    |one image
 |3D UNET	        |one image
-|RNNT	            |one raw speech sample up to 15 seconds
 |BERT	            |one sequence
 |DLRMv2	            |up to 700 user-item pairs (more details in FAQ)
 |GPT-J	            |one sequence
@@ -253,8 +252,6 @@ The Datacenter suite includes the following benchmarks:
 |Vision |Image classification |Resnet50-v1.5 |ImageNet (224x224) | 1024 | 99% of FP32 (76.46%) | 15 ms
 |Vision |Object detection |Retinanet |OpenImages (800x800) | 64 | 99% of FP32 (0.3755 mAP) | 100 ms
 |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A
-|Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms
-|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
@@ -269,8 +266,6 @@ Each Datacenter benchmark *requires* the following scenarios:
 |Vision |Image classification |Server, Offline
 |Vision |Object detection |Server, Offline
 |Vision |Medical image segmentation |Offline
-|Speech |Speech-to-text |Server, Offline
-|Language |Language processing |Server, Offline
 |Language |Summarization |Server, Offline
 |Language |Question Answering |Server, Offline
 |Commerce |Recommendation |Server, Offline
@@ -284,8 +279,7 @@ The Edge suite includes the following benchmarks:
 |Vision |Image classification |Resnet50-v1.5 |ImageNet (224x224) | 1024 | 99% of FP32 (76.46%)
 |Vision |Object detection |Retinanet |OpenImages (800x800) | 64 | 99% of FP32 (0.3755 mAP)
 |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)
-|Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds)| 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%)
-|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%)
+|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32(f1_score=90.874%)
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878)
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]
 |===
@@ -297,7 +291,6 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an
 |Vision |Image classification |Single Stream, Multistream, Offline
 |Vision |Object detection |Single Stream, Multistream, Offline
 |Vision |Medical image segmentation |Single Stream, Offline
-|Speech |Speech-to-text |Single Stream, Offline
 |Language |Language processing |Single Stream, Offline
 |Generative |Text to image |Single Stream, Offline
 |Language |Summarization |Single Stream, Offline
@@ -536,9 +529,7 @@ Data formats for inputs and outputs are allowed to be compressed for network tra
 1) No compression 2) Lossless compression
 
 This rule applies both for the QSL pre-processing and for post-processing function allowed in QDL for this benchmark results.
-|Speech | Speech-to-text | RNNT | Allow one of the following compression options for pre-processing:
 
-1) No compression 2) Lossless compression 3) The original compression of the dataset (FLAC)
 |Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation).
 
 1) No compression 2) Lossless compression
@@ -1033,7 +1024,6 @@ Datacenter systems must provide at least the following bandwidths from the netwo
 |Vision |Resnet50-v1.5 |ImageNet (224x224) | __C*H*W*dtype_size__ | __3*224*224*dtype_size__ | __throughput*150528*dtype_size__
 |Vision |Retinanet |OpenImages (800x800) | __C*H*W*dtype_size__ | __3*800*800*dtype_size__ | __throughput*1920000*dtype_size__
 |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[The average image size above is the average image size of the inference cases specified in https://github.com/mlcommons/inference/blob/master/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json[inference_cases.json].] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__
-|Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | __max_audio_duration*num_samples_per_sec*(bits_per_sample/8)__ | __15*16000*(16/8)__ | __throughput*480000__
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__
 |Language |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__
@@ -1050,7 +1040,6 @@ Datacenter systems must provide at least the following bandwidths from the outpu
 |Vision |Resnet50-v1.5 |ImageNet (224x224) | negligible | negligible | __> 0__
 |Vision |Retinanet |OpenImages (800x800) | negligible | negligible | __> 0__
 |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__
-|Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | negligible | negligible | __> 0__
 |Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__
 |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048)  | negligible | negligible | __> 0__
 |Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__

From 7838420760dea6b8ce38a113c18d32b8a63f8016 Mon Sep 17 00:00:00 2001
From: Zhihan <zhihanj@nvidia.com>
Date: Thu, 24 Oct 2024 13:39:28 -0700
Subject: [PATCH 2/2] Update accuracy numbers to address mixtral 0-token issue

---
 inference_rules.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/inference_rules.adoc b/inference_rules.adoc
index cb571e7..b1e6991 100644
--- a/inference_rules.adoc
+++ b/inference_rules.adoc
@@ -254,7 +254,7 @@ The Datacenter suite includes the following benchmarks:
 |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A
 |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s
 |Language |Question Answering |Llama2 |OpenOrca (GPT-4 split, max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
-|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
+|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples of the GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) | 15000 | 99% of FP32 and 99.9% of FP32 (rouge1=45.5989, rouge2=23.3526, rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=144.84)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms
 |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms
 |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s
 |===