From f8c6233ec8835ca0aca46e7413846f092497811f Mon Sep 17 00:00:00 2001
From: Vijay Janapa Reddi <vj@eecs.harvard.edu>
Date: Tue, 18 Jun 2024 08:49:58 -0400
Subject: [PATCH] Updated the a bit based on Colby's updates

---
 contents/benchmarking/benchmarking.bib |   9 ++
 contents/benchmarking/benchmarking.qmd | 143 +++++++++++--------------
 2 files changed, 72 insertions(+), 80 deletions(-)

diff --git a/contents/benchmarking/benchmarking.bib b/contents/benchmarking/benchmarking.bib
index c0f9dce1c..273602fce 100644
--- a/contents/benchmarking/benchmarking.bib
+++ b/contents/benchmarking/benchmarking.bib
@@ -49,6 +49,15 @@ @article{beyer2020we
  year = {2020},
 }
 
+@inproceedings{deng2009imagenet,
+ author = {Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
+ title = {Imagenet: {A} large-scale hierarchical image database},
+ booktitle = {2009 IEEE conference on computer vision and pattern recognition},
+ pages = {248--255},
+ year = {2009},
+ organization = {Ieee},
+}
+
 @inproceedings{brown2020language,
  author = {Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and Winter, Clemens and Hesse, Christopher and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
  editor = {Larochelle, Hugo and Ranzato, Marc'Aurelio and Hadsell, Raia and Balcan, Maria-Florina and Lin, Hsuan-Tien},
diff --git a/contents/benchmarking/benchmarking.qmd b/contents/benchmarking/benchmarking.qmd
index 061620cb3..f8f4046a1 100644
--- a/contents/benchmarking/benchmarking.qmd
+++ b/contents/benchmarking/benchmarking.qmd
@@ -19,9 +19,17 @@ This chapter will provide an overview of popular ML benchmarks, best practices f
 
 ## Learning Objectives
 
-* Understand the purpose and goals of benchmarking AI systems.
+* Understand the purpose and goals of benchmarking AI systems across model, data, and system dimensions.
 
-* Learn how to design benchmarks and interpret their results.
+* Learn about key model benchmarks, metrics, and trends, including accuracy, fairness, complexity, and efficiency. 
+
+* Discover the importance of data-centric AI and data benchmarking to assess dataset quality, diversity, and efficiency.
+
+* Explore system benchmarking at different levels of granularity, from micro to end-to-end benchmarks, and key metrics for training and inference.
+
+* Recognize the value of an integrated approach that benchmarks the interplay between models, data, and systems.
+
+* Understand the challenges and emerging trends in AI benchmarking, including benchmarks for new technologies.
 
 :::
 
@@ -139,11 +147,11 @@ Macro benchmarks provide a holistic view, assessing the end-to-end performance o
 
 Examples: These benchmarks evaluate the AI model:
 
-* [MLPerf Inference](https://github.com/mlcommons/inference)(@reddi2020mlperf): An industry-standard set of benchmarks for measuring the performance of machine learning software and hardware. MLPerf has a suite of dedicated benchmarks for specific scales, such as [MLPerf Mobile](https://github.com/mlcommons/mobile_app_open) for mobile class devices and [MLPerf Tiny](https://github.com/mlcommons/tiny), which focuses on microcontrollers and other resource-constrained devices.
+* [MLPerf Inference](https://github.com/mlcommons/inference) [@reddi2020mlperf]: An industry-standard set of benchmarks for measuring the performance of machine learning software and hardware. MLPerf has a suite of dedicated benchmarks for specific scales, such as [MLPerf Mobile](https://github.com/mlcommons/mobile_app_open) for mobile class devices and [MLPerf Tiny](https://github.com/mlcommons/tiny), which focuses on microcontrollers and other resource-constrained devices.
 
 * [EEMBC's MLMark](https://github.com/eembc/mlmark): A benchmarking suite for evaluating the performance and power efficiency of embedded devices running machine learning workloads. This benchmark provides insights into how different hardware platforms handle tasks like image recognition or audio processing.
 
-* [AI-Benchmark](https://ai-benchmark.com/)(@ignatov2018ai): A benchmarking tool designed for Android devices, it evaluates the performance of AI tasks on mobile devices, encompassing various real-world scenarios like image recognition, face parsing, and optical character recognition.
+* [AI-Benchmark](https://ai-benchmark.com/) [@ignatov2018ai]: A benchmarking tool designed for Android devices, it evaluates the performance of AI tasks on mobile devices, encompassing various real-world scenarios like image recognition, face parsing, and optical character recognition.
 
 #### End-to-end Benchmarks
 
@@ -171,23 +179,21 @@ Finally, organizations can make informed decisions on where to allocate resource
 
 At its core, an AI benchmark is more than just a test or a score; it's a comprehensive evaluation framework. To understand this in-depth, let's break down the typical components that go into an AI benchmark.
 
-1. **Task & Datasets:**
-Datasets serve as the foundation for most AI benchmarks and specify the task that the model aims to achieve. They provide a consistent data set on which models are trained and evaluated, ensuring a level playing field for comparisons. When selecting the tasks in a benchmark, you must account for the task diversity (e.g., various data types, difficulty, & scale), in addition to the availability of a suitable dataset and the relevance of the task to meaningful real-world applications.
-Example: ImageNet, a large-scale dataset containing millions of labeled images spanning thousands of categories, is a popular benchmarking standard for image classification tasks.
+#### Task & Datasets
+
+Datasets serve as the foundation for most AI benchmarks and specify the task that the model aims to achieve. They provide a consistent data set on which models are trained and evaluated, ensuring a level playing field for comparisons. When selecting the tasks in a benchmark, you must account for the task diversity (e.g., various data types, difficulty, & scale), in addition to the availability of a suitable dataset and the relevance of the task to meaningful real-world applications. [ImageNet](https://www.image-net.org) is an example of a large-scale dataset containing millions of labeled images spanning thousands of categories, is a popular benchmarking standard for image classification tasks [@deng2009imagenet].
 
-2. **Evaluation Metrics:*** 
-Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems.
-In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks.
+#### Evaluation Metrics
+Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. In classification tasks, metrics like accuracy, precision, recall, and [F1 score](https://en.wikipedia.org/wiki/F-score) are commonly used. Mean squared or absolute errors might be employed for regression tasks.
 
-3. **Baselines:***
-Benchmarks often include baseline models or reference implementations. These serve as starting points or minimum performance standards against which new models or techniques can be compared.
-Example: In many benchmark suites, simple models like linear regression or basic neural networks serve as baselines to provide context for more complex model evaluations.
+#### Baselines
+Benchmarks often include baseline models or reference implementations. These serve as starting points or minimum performance standards against which new models or techniques can be compared. In many benchmark suites, simple models like linear regression or basic neural networks serve as baselines to provide context for more complex model evaluations.
 
-4. **Hardware and Software Specifications:**
-Given the variability introduced by different hardware and software configurations, benchmarks often specify or document the hardware and software environments in which tests are conducted.
-Example: An AI benchmark might note that evaluations were conducted on an NVIDIA Tesla V100 GPU using TensorFlow v2.4.
+#### Hardware and Software Specifications
 
-These components form the basis of a benchmark, but successful benchmarks go beyond the core components. To have interpretable and reproducible results, you often need to control for environmental conditions (e.g., temperature) and specify how the results should be interpreted and compared (e.g. latency/joule).
+Given the variability introduced by different hardware and software configurations, benchmarks often specify or document the hardware and software environments in which tests are conducted. An AI benchmark might note that evaluations were conducted on an NVIDIA Tesla V100 GPU using TensorFlow v2.4.
+
+These components form the basis of a benchmark, but successful benchmarks go beyond the core components. To have interpretable and reproducible results, you often need to control for environmental conditions (e.g., temperature) and specify how the results should be interpreted and compared (e.g., latency per joule). These specifications are commonly referred to as "run rules." For example, in mobile AI benchmarks, the run rules might specify that the tests should be conducted at room temperature with devices plugged into a power source to eliminate battery-level variances.
 
 ### Training vs. Inference
 
@@ -199,7 +205,7 @@ On the other hand, benchmarking inference evaluates model performance in real-wo
 
 ### Training Benchmarks
 
-Training represents the phase where the system processes and ingests raw data to adjust and refine its parameters. Therefore, it is an algorithmic activity and involves system-level considerations, including data pipelines, storage, computing resources, and orchestration mechanisms. The goal is to ensure that the ML system can efficiently learn from data, optimizing both the model's performance and the system's resource utilization.
+[Training](../training/training.qmd) represents the phase where the system processes and ingests raw data to adjust and refine its parameters. Therefore, it is an algorithmic activity and involves system-level considerations, including data pipelines, storage, computing resources, and orchestration mechanisms. The goal is to ensure that the ML system can efficiently learn from data, optimizing both the model's performance and the system's resource utilization.
 
 #### Purpose
 
@@ -235,37 +241,25 @@ By benchmarking for these types of metrics, we can obtain a comprehensive view o
 
 Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for training machine learning systems.
 
-*[MLPerf Training Benchmark](https://github.com/mlcommons/training)*
-
-MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning.
-
-Metrics:
-
-* Training time to target quality
-* Throughput (examples per second)
-* Resource utilization (CPU, GPU, memory, disk I/O)
-
-*[DAWNBench](https://dawn.cs.stanford.edu/benchmark/)*
-
-DAWNBench [@coleman2017dawnbench] is a benchmark suite focusing on end-to-end deep learning training time and inference performance. It includes common tasks such as image classification and question answering.
-
-Metrics:
+**[DAWNBench](https://dawn.cs.stanford.edu/benchmark/):** DAWNBench [@coleman2017dawnbench] was the first benchmark suite focusing on end-to-end deep learning training time, and subsequently inference performance. It include common tasks such as image classification and question answering. It's metrics include:
 
 * Time to train to target accuracy
 * Inference latency
 * Cost (in terms of cloud computing and storage resources)
 
-*[Fathom](https://github.com/rdadolf/fathom)*
-
-Fathom [@adolf2016fathom] is a benchmark from Harvard University that evaluates the performance of deep learning models using a diverse set of workloads. These include common tasks such as image classification, speech recognition, and language modeling.
-
-Metrics:
+**[Fathom](https://github.com/rdadolf/fathom):** Fathom [@adolf2016fathom] was one of the first benchmarks (from Harvard University) to evaluate the performance of deep learning models using a diverse set of workloads. These include common tasks such as image classification, speech recognition, and language modeling. It's metrics include:
 
 * Operations per second (to measure computational efficiency)
 * Time to completion for each workload
 * Memory bandwidth
 
-*Example Use Case*
+**[MLPerf Training Benchmark](https://github.com/mlcommons/training):** MLPerf is a suite of benchmarks that grew out of DAWNBench and Fathom and other collective works such as [DeepBench](https://github.com/baidu-research/DeepBench) that was designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. It's metrics include: 
+
+* Training time to target quality
+* Throughput (examples per second)
+* Resource utilization (CPU, GPU, memory, disk I/O)
+
+##### Example Use Case
 
 Consider a scenario where we want to benchmark the training of an image classification model on a specific hardware platform.
 
@@ -298,10 +292,12 @@ Finally, it is vital to ensure that the model's predictions are not only accurat
 1. **Accuracy:** Accuracy is one of the most vital metrics when benchmarking machine learning models. It quantifies the proportion of correct predictions made by the model compared to the true values or labels. For example, if a spam detection model can correctly classify 95 out of 100 email messages as spam or not, its accuracy would be calculated as 95%.
 
 2. **Latency or Throughput:** The appropriate performance metric depends on the task. Latency is a performance metric that calculates the time lag or delay between the input receipt and the production of the corresponding output by the machine learning system. An example that clearly depicts latency is a real-time translation application; if a half-second delay exists from the moment a user inputs a sentence to the time the app displays the translated text, then the system's latency is 0.5 seconds.
-In many cases, the throughput is more important.Throughput assesses the system's capacity by measuring the number of inferences or predictions a machine learning model can handle within a specific unit of time. Consider a speech recognition system that employs a Recurrent Neural Network (RNN) as its underlying model; if this system can process and understand 50 different audio clips in a minute, then its throughput rate stands at 50 clips per minute.
-In some cases, you care about both metrics and measure latency-bounded throughput, which measures the maximum throughput of a system while still meeting a specified latency constraint.
 
-3. **Energy Efficiency:** Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference.
+    In many cases, the throughput is more important.Throughput assesses the system's capacity by measuring the number of inferences or predictions a machine learning model can handle within a specific unit of time. Consider a speech recognition system that employs a Recurrent Neural Network (RNN) as its underlying model; if this system can process and understand 50 different audio clips in a minute, then its throughput rate stands at 50 clips per minute.
+
+    In some cases, you care about both metrics and measure latency-bounded throughput, which measures the maximum throughput of a system while still meeting a specified latency constraint.
+
+1. **Energy Efficiency:** Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference.
 
 Other inference considerations, such as memory consumption, are typically constraints rather than directly benchmarked metrics. For example, if a system does not have enough memory to fit the model onto the device then it can't run the model and produce a result. In some cases, one can use compression techniques to make the model fit, such as quantization, but any negative impact of those techniques are captured in the accuracy metric.
 
@@ -309,23 +305,14 @@ Other inference considerations, such as memory consumption, are typically constr
 
 Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for inference machine learning systems.
 
-*[MLPerf Inference Benchmark](https://github.com/mlcommons/inference)*
-
-MLPerf Inference is a comprehensive benchmark suite that assesses machine learning models' performance during the inference phase. It encompasses a variety of workloads, including image classification, object detection, and natural language processing, aiming to provide standardized and insightful metrics for evaluating different inference systems.
-
-Metrics:
+**[MLPerf Inference Benchmark](https://github.com/mlcommons/inference):** MLPerf Inference is a comprehensive benchmark suite that assesses machine learning models' performance during the inference phase. It encompasses a variety of workloads, including image classification, object detection, and natural language processing, aiming to provide standardized and insightful metrics for evaluating different inference systems. It's metrics include: 
 
 * Inference time
-* Latency
-* Throughput
+* Latency and/or throughput
 * Accuracy
 * Energy consumption
 
-*[AI Benchmark](https://ai-benchmark.com/)*
-
-AI Benchmark is a benchmarking tool that evaluates the performance of AI and machine learning models on mobile devices and edge computing platforms. It includes tests for image classification, object detection, and natural language processing tasks, providing a detailed analysis of the inference performance on different hardware platforms.
-
-Metrics:
+**[AI Benchmark](https://ai-benchmark.com/):** AI Benchmark is a benchmarking tool that evaluates the performance of AI and machine learning models on mobile devices and edge computing platforms. It includes tests for image classification, object detection, and natural language processing tasks, providing a detailed analysis of the inference performance on different hardware platforms. It's metrics include: 
 
 * Inference time
 * Latency
@@ -333,26 +320,16 @@ Metrics:
 * Memory usage
 * Throughput
 
-*[OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)*
-
-OpenVINO toolkit provides a benchmark tool to measure the performance of deep learning models for various tasks, such as image classification, object detection, and facial recognition, on Intel hardware. It offers detailed insights into the models' inference performance on different hardware configurations.
-
-Metrics:
+**[OpenVINO toolkit](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html):** OpenVINO toolkit provides a benchmark tool to measure the performance of deep learning models for various tasks, such as image classification, object detection, and facial recognition, on Intel hardware. It offers detailed insights into the models' inference performance on different hardware configurations. It's metrics include: 
 
 * Inference time
 * Throughput
 * Latency
 * CPU and GPU utilization
 
-*Example Use Case*
-
-Consider a scenario where we want to evaluate the inference performance of an object detection model on a specific edge device.
-
-Task: The task is to perform real-time object detection on video streams, detecting and identifying objects such as vehicles, pedestrians, and traffic signs.
-
-Benchmark: We can use the AI Benchmark for this task as it evaluates inference performance on edge devices, which suits our scenario.
+#### Example Use Case
 
-Metrics: We will measure the following metrics:
+Consider a scenario where we want to evaluate the inference performance of an object detection model on a specific edge device. The task is to perform real-time object detection on video streams, detecting and identifying objects such as vehicles, pedestrians, and traffic signs. We can use the AI Benchmark for this task as it evaluates inference performance on edge devices, which suits our scenario. We will measure the following metrics:
 
 * Inference time to process each video frame
 * Latency to generate the bounding boxes for detected objects
@@ -400,10 +377,14 @@ MLPerf Tiny uses [EEMBCs EnergyRunner benchmark harness](https://github.com/eemb
 Baseline submissions are critical for contextualizing results and as a reference point to help participants get started. The baseline submission should prioritize simplicity and readability over state-of-the-art performance. The keyword spotting baseline uses a standard [STM microcontroller](https://www.st.com/en/microcontrollers-microprocessors.html) as its hardware and [TensorFlow Lite for Microcontrollers](https://www.tensorflow.org/lite/microcontrollers) (@david2021tensorflow) as its inference framework.
 
 #### Modular Design
-MLPerf Tiny aims to support the benchmarking of any component of the ML system stack without sacrificing the comparability of two results.
-Its modular design allows components to be swapped out for comparison or improvement. The reference implementations, shown in green and orange in @fig-ml-perf, act as the baseline for results. 
-In the closed division, the hardware can be swapped out while the model and dataset are fixed, leading to apples-to-apples comparisons between two devices.
-However, the open division allows users to showcase their contributions and competitive advantage elsewhere in the stack by modifying a reference implementation (e.g. the model). In short, MLPerf Tiny offers a flexible and modular way to assess and enhance TinyML applications, making it easier to compare and improve different aspects of the technology.
+
+MLPerf Tiny [@banbury2020benchmarking] supports the benchmarking of any component of the ML system stack without sacrificing the comparability of two results. Its modular design allows components to be swapped out for comparison or improvement. The reference implementations, shown in green and orange in @fig-ml-perf, act as the baseline for results.
+
+In the closed division, the hardware can be swapped out while the model and dataset remain fixed, leading to apples-to-apples comparisons between two devices. This approach ensures that the performance differences observed can be attributed solely to the hardware changes, providing valuable insights into the capabilities of different devices.
+
+On the other hand, the open division allows users to showcase their contributions and competitive advantages elsewhere in the stack by modifying a reference implementation (e.g., the model). This flexibility enables participants to demonstrate their innovative solutions and optimizations across various components of the ML system.
+
+In short, MLPerf Tiny offers a flexible and modular way to assess and enhance TinyML applications. By allowing the benchmarking of individual components or the entire system, MLPerf Tiny makes it easier to compare and improve different aspects of the technology, ultimately driving advancements in TinyML.
 
 ![MLPerf Tiny modular design. Credit: @mattson2020mlperf.](images/png/mlperf_tiny.png){#fig-ml-perf}
 
@@ -411,17 +392,19 @@ However, the open division allows users to showcase their contributions and comp
 
 While benchmarking provides a structured methodology for performance evaluation in complex domains like artificial intelligence and computing, the process also poses several challenges. If not properly addressed, these challenges can undermine the credibility and accuracy of benchmarking results. Some of the predominant difficulties faced in benchmarking include the following:
 
-* Incomplete problem coverage—Benchmark tasks may not fully represent the problem space. For instance, common image classification datasets like [CIFAR-10](https://www.cs.toronto.edu/kriz/cifar.html) have limited diversity in image types. Algorithms tuned for such benchmarks may fail to generalize well to real-world datasets.
-* Statistical insignificance - Benchmarks must have enough trials and data samples to produce statistically significant results. For example, benchmarking an OCR model on only a few text scans may not adequately capture its true error rates.
-* Limited reproducibility—Varying hardware, software versions, codebases, and other factors can reduce the reproducibility of benchmark results. MLPerf addresses this by providing reference implementations and environment specifications.
-* Misalignment with end goals - Benchmarks focusing only on speed or accuracy metrics may misalign real-world objectives like cost and power efficiency. Benchmarks must reflect all critical performance axes.
-* Rapid staleness—Due to the rapid pace of advancements in AI and computing, benchmarks and their datasets can quickly become outdated. Maintaining up-to-date benchmarks is thus a persistent challenge.
+* **Incomplete problem coverage:** Benchmark tasks may not fully represent the problem space. For instance, common image classification datasets like [CIFAR-10](https://www.cs.toronto.edu/kriz/cifar.html) have limited diversity in image types. Algorithms tuned for such benchmarks may fail to generalize well to real-world datasets.
+* **Statistical insignificance:** Benchmarks must have enough trials and data samples to produce statistically significant results. For example, benchmarking an OCR model on only a few text scans may not adequately capture its true error rates.
+* **Limited reproducibility:** Varying hardware, software versions, codebases, and other factors can reduce the reproducibility of benchmark results. MLPerf addresses this by providing reference implementations and environment specifications.
+* **Misalignment with end goals:**  Benchmarks focusing only on speed or accuracy metrics may misalign real-world objectives like cost and power efficiency. Benchmarks must reflect all critical performance axes.
+* **Rapid staleness:** Due to the rapid pace of advancements in AI and computing, benchmarks and their datasets can quickly become outdated. Maintaining up-to-date benchmarks is thus a persistent challenge.
 
-But of all these, the most important challenge is benchmark engineering.
+However, of all these challenges, the most significant and probmeatic ones are the "hardware lottery" and benchmark engineering.
 
 #### Hardware Lottery
 
-The ["hardware lottery"](https://arxiv.org/abs/2009.06489) in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. @fig-hardware-lottery demonstrates the performance of different models on different hardware: notice how (follow the big yellow arrow) the Mobilenet V3 Large model (in green) has the lowest latency among all models when run unquantized on the Pixel4 CPU. At the same time, it performs the worst on Pixel4 DSP Qualcomm Snapdragon 855. Unfortunately, the hardware used is often omitted from papers or only briefly mentioned, making reproducing results difficult, if possible.
+The ["hardware lottery"](https://arxiv.org/abs/2009.06489) in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. 
+
+@fig-hardware-lottery demonstrates the performance of different models on different hardware: notice how (follow the big yellow arrow) the Mobilenet V3 Large model (in green) has the lowest latency among all models when run unquantized on the Pixel4 CPU. At the same time, it performs the worst on Pixel4 DSP Qualcomm Snapdragon 855. Unfortunately, the hardware used is often omitted from papers or only briefly mentioned, making reproducing results difficult, if possible.
 
 ![Hardware Lottery.](images/png/hardware_lottery.png){#fig-hardware-lottery}
 
@@ -429,7 +412,7 @@ For instance, certain machine learning models may be designed and optimized to t
 
 The "hardware lottery" can introduce challenges and biases in benchmarking machine learning systems, as the model's performance is not solely dependent on the model's architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends.
 
-This has additional impacts on hardware benchmarks. Modern popular models are often optimized specifically for GPUs as they are the most common hardware for training and inference. This puts other hardware architectures at a disadvantage when compared against GPUs since so much engineering effort has been put into optimizing for that type of hardware architecture. This can cause a feedback loop where GPUs look the best on benchmarks, so new models are optimized for GPUs, so GPUs win the benchmarks, and so on. Without careful design of benchmarks, the field of machine learning can fall into a local minimum and not explore new and promising types of ML hardware architecture.
+This has additional impacts on hardware benchmarks. Modern popular models are often optimized specifically for GPUs as they are the most common hardware for training and inference. This puts other hardware architectures at a disadvantage when compared against GPUs since so much engineering effort has been put into optimizing for that type of hardware architecture. This can cause a feedback loop where GPUs look the best on benchmarks, so new models are optimized for GPUs, so GPUs win the benchmarks, and so on. Without careful design of benchmarks, machine learning can fall into a local minimum and not explore new and promising types of ML hardware architecture.
 
 #### Benchmark Engineering
 
@@ -671,7 +654,7 @@ Several approaches can be taken to improve data quality. These methods include a
 * **Active Learning:** This is a semi-supervised learning approach where the model actively queries a human oracle to label the most informative samples [@coleman2022similarity]. This ensures that the model is trained on the most relevant data.
 * Dimensionality Reduction: Techniques like PCA can reduce the number of features in a dataset, thereby reducing complexity and training time.
 
-There are many other methods in the wild. But the goal is the same. Refining the dataset and ensuring it is of the highest quality can reduce the training time required for models to converge. However, achieving this requires developing and implementing sophisticated methods, algorithms, and techniques that can clean, preprocess, and augment data while retaining the most informative samples. This is an ongoing challenge that will require continued research and innovation in the field of machine learning.
+There are many other methods in the wild. But the goal is the same. Refining the dataset and ensuring it is of the highest quality can reduce the training time required for models to converge. However, achieving this requires developing and implementing sophisticated methods, algorithms, and techniques that can clean, preprocess, and augment data while retaining the most informative samples. This is an ongoing challenge that will require continued research and innovation in machine learning.
 
 ## The Trifecta