From 808fa062085b56b562fd9a8d11dc85cbf7b6734b Mon Sep 17 00:00:00 2001 From: Arjun Suresh Date: Tue, 12 Dec 2023 16:28:58 +0000 Subject: [PATCH 1/2] Update inference_rules.adoc --- inference_rules.adoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/inference_rules.adoc b/inference_rules.adoc index a86dec7..5719a72 100644 --- a/inference_rules.adoc +++ b/inference_rules.adoc @@ -137,10 +137,12 @@ described in the table below. |Scenario |Query Generation |Duration |Samples/query |Latency Constraint |Tail Latency | Performance Metric |Single stream |LoadGen sends next query as soon as SUT completes the previous query | 600 seconds |1 |None |90%* | 90%-ile early-stopping latency estimate |Server |LoadGen sends new queries to the SUT according to a Poisson distribution | 600 seconds |1 |Benchmark specific |99%* | Maximum Poisson throughput parameter supported -|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24,576 |None |N/A | Measured throughput +|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24576** |None |N/A | Measured throughput |Multistream | Loadgen sends next query, as soon as SUT completes the previous query | 600 seconds | 8 | None | 99%* | 99%-ile early-stopping latency estimate| |=== + ** - If the dataset used for the accuracy run of the benchmark task is of size less than 24576 say `N`, then the Offline scenario query only needs to have at least `N` samples. + An early stopping criterion (described in more detail in <>) allows for runs with a relatively small number of processed queries to be valid, with the penalty that the effective computed percentile will be slightly higher. This penalty counteracts the increased variance inherent to runs with few queries, where there is a higher probability that a particular run will, by chance, report a lower latency than the system should reliably support. In the above table, tail latency percentiles with an asterisk represent the theoretical lower limit of measured percentile for runs processing a very large number of queries. Submitters may opt to run for longer than the time listed in the "Duration" column, in order to decrease the effect of the early stopping penalty. See the following table for a suggested starting point for how to set the minimum number of queries. From 7e12bdfb4587da8c5f4b76001785c0edba0bc697 Mon Sep 17 00:00:00 2001 From: Arjun Suresh Date: Tue, 12 Dec 2023 16:30:31 +0000 Subject: [PATCH 2/2] Update inference_rules.adoc --- inference_rules.adoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/inference_rules.adoc b/inference_rules.adoc index 5719a72..796f7d3 100644 --- a/inference_rules.adoc +++ b/inference_rules.adoc @@ -137,11 +137,11 @@ described in the table below. |Scenario |Query Generation |Duration |Samples/query |Latency Constraint |Tail Latency | Performance Metric |Single stream |LoadGen sends next query as soon as SUT completes the previous query | 600 seconds |1 |None |90%* | 90%-ile early-stopping latency estimate |Server |LoadGen sends new queries to the SUT according to a Poisson distribution | 600 seconds |1 |Benchmark specific |99%* | Maximum Poisson throughput parameter supported -|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24576** |None |N/A | Measured throughput +|Offline |LoadGen sends all samples to the SUT at start in a single query | 1 query and 600 seconds | At least 24,576** |None |N/A | Measured throughput |Multistream | Loadgen sends next query, as soon as SUT completes the previous query | 600 seconds | 8 | None | 99%* | 99%-ile early-stopping latency estimate| |=== - ** - If the dataset used for the accuracy run of the benchmark task is of size less than 24576 say `N`, then the Offline scenario query only needs to have at least `N` samples. + ** - If the dataset used for the accuracy run of the benchmark task is of size less than 24,576 say `N`, then the Offline scenario query only needs to have at least `N` samples. An early stopping criterion (described in more detail in <>) allows for runs with a relatively small number of processed queries to be valid, with the penalty that the effective computed percentile will be slightly higher. This penalty counteracts the increased variance inherent to runs with few queries, where there is a higher probability that a particular run will, by chance, report a lower latency than the system should reliably support.