diff --git a/inference_rules.adoc b/inference_rules.adoc index 88c8e8f..35b1df6 100644 --- a/inference_rules.adoc +++ b/inference_rules.adoc @@ -174,7 +174,6 @@ Each sample has the following definition: |Resnet50-v1.5 |one image |Retinanet |one image |3D UNET |one image -|RNNT |one raw speech sample up to 15 seconds |BERT |one sequence |DLRMv2 |up to 700 user-item pairs (more details in FAQ) |GPT-J |one sequence @@ -253,11 +252,9 @@ The Datacenter suite includes the following benchmarks: |Vision |Image classification |Resnet50-v1.5 |ImageNet (224x224) | 1024 | 99% of FP32 (76.46%) | 15 ms |Vision |Object detection |Retinanet |OpenImages (800x800) | 64 | 99% of FP32 (0.3755 mAP) | 100 ms |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | N/A -|Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) | 1000 ms -|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | 130 ms |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the total generation length of the texts should be more than 90% of the reference (gen_len=4016878)| 20 s |Language |Question Answering |Llama2 |OpenOrca (max_seq_len=1024) | 24576 | 99% of FP32 and 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162). Additionally, for both cases the generation length of the tokens per sample should be more than 90% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Llama2, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms -|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | 15000 | 99% of FP16 ((OpenOrca)rouge1=45.4911, (OpenOrca)rouge2=23.2829, (OpenOrca)rougeL=30.3615, (gsm8k)Accuracy=73.78, (mbxp)Accuracy=60.12). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=294.45)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms +|Language |Text Generation (Question Answering, Math and Code Generation) |Mixtral-8x7B |OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048) | 15000 | 99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16). Additionally, for both cases the tokens per sample should be between than 90% and 110% of the reference (tokens_per_sample=144.84)| TTFT/TPOTfootnote:[For Mixtral-8x7B, 2 latency metrics are collected - time to first token (TTFT) which measures the latency of the first token, and time per output token (TPOT) which measures the average interval between all the tokens generated.]: 2000 ms/200 ms |Commerce |Recommendation |DLRMv2 |Synthetic Multihot Criteo Dataset | 204800 |99% of FP32 and 99.9% of FP32 (AUC=80.31%) | 60 ms |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] | 20 s |=== @@ -269,8 +266,6 @@ Each Datacenter benchmark *requires* the following scenarios: |Vision |Image classification |Server, Offline |Vision |Object detection |Server, Offline |Vision |Medical image segmentation |Offline -|Speech |Speech-to-text |Server, Offline -|Language |Language processing |Server, Offline |Language |Summarization |Server, Offline |Language |Question Answering |Server, Offline |Commerce |Recommendation |Server, Offline @@ -284,8 +279,7 @@ The Edge suite includes the following benchmarks: |Vision |Image classification |Resnet50-v1.5 |ImageNet (224x224) | 1024 | 99% of FP32 (76.46%) |Vision |Object detection |Retinanet |OpenImages (800x800) | 64 | 99% of FP32 (0.3755 mAP) |Vision |Medical image segmentation |3D UNET |KiTS 2019 | 42 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) -|Speech |Speech-to-text |RNNT |Librispeech dev-clean (samples < 15 seconds)| 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645%) -|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%) +|Language |Language processing |BERT |SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32(f1_score=90.874%) |Language |Summarization |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | 13368 | 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881). Additionally, for both cases the generation length should be more than 90% of the reference (gen_len=4016878) |Generative |Text to image |SDXL |Subset of coco-2014 val | 5000 |FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] |=== @@ -297,7 +291,6 @@ Each Edge benchmark *requires* the following scenarios, and sometimes permit an |Vision |Image classification |Single Stream, Multistream, Offline |Vision |Object detection |Single Stream, Multistream, Offline |Vision |Medical image segmentation |Single Stream, Offline -|Speech |Speech-to-text |Single Stream, Offline |Language |Language processing |Single Stream, Offline |Generative |Text to image |Single Stream, Offline |Language |Summarization |Single Stream, Offline @@ -536,9 +529,7 @@ Data formats for inputs and outputs are allowed to be compressed for network tra 1) No compression 2) Lossless compression This rule applies both for the QSL pre-processing and for post-processing function allowed in QDL for this benchmark results. -|Speech | Speech-to-text | RNNT | Allow one of the following compression options for pre-processing: -1) No compression 2) Lossless compression 3) The original compression of the dataset (FLAC) |Language | Language processing | BERT-large | Input is either Token IDs, Input Masks and Segment IDs or just the Token IDs (generating the other tensors at the SUT in a timed operation). 1) No compression 2) Lossless compression @@ -1033,7 +1024,6 @@ Datacenter systems must provide at least the following bandwidths from the netwo |Vision |Resnet50-v1.5 |ImageNet (224x224) | __C*H*W*dtype_size__ | __3*224*224*dtype_size__ | __throughput*150528*dtype_size__ |Vision |Retinanet |OpenImages (800x800) | __C*H*W*dtype_size__ | __3*800*800*dtype_size__ | __throughput*1920000*dtype_size__ |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[The average image size above is the average image size of the inference cases specified in https://github.com/mlcommons/inference/blob/master/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json[inference_cases.json].] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__ -|Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | __max_audio_duration*num_samples_per_sec*(bits_per_sample/8)__ | __15*16000*(16/8)__ | __throughput*480000__ |Language |BERT |SQuAD v1.1 (max_seq_len=384) | __num_inputs*max_seq_len*dtype_size__ | __3*384*dtype_size__ | __throughput*1152*dtype_size__ |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | __num_inputs*max_seq_len*dtype_size__ | __2048*dtype_size__ | __throughput*2048*dtype_size__ |Language |Llama2 |OpenOrca (max_seq_len=1024) | __num_inputs*max_seq_len*dtype_size__ | __1024*dtype_size__ | __throughput*1024*dtype_size__ @@ -1050,7 +1040,6 @@ Datacenter systems must provide at least the following bandwidths from the outpu |Vision |Resnet50-v1.5 |ImageNet (224x224) | negligible | negligible | __> 0__ |Vision |Retinanet |OpenImages (800x800) | negligible | negligible | __> 0__ |Vision |3D UNET | KiTS 2019 | __avg(C*D*H*W)*dtype_size__footnote:3d_unet_bw[] | __32944795*dtype_size__ | __throughput*32944795*dtype_size__ -|Speech |RNNT |Librispeech dev-clean (samples < 15 seconds) | negligible | negligible | __> 0__ |Language |BERT |SQuAD v1.1 (max_seq_len=384) | negligible | negligible | __> 0__ |Language |GPT-J |CNN Dailymail (v3.0.0, max_seq_len=2048) | negligible | negligible | __> 0__ |Commerce |DLRMv2 |Synthetic Multihot Criteo Dataset | negligible | negligible | __> 0__