harvard-edge · profvjreddi · Feb 2, 2024 · Jan 26, 2024
diff --git a/contents/efficient_ai/efficient_ai.bib b/contents/efficient_ai/efficient_ai.bib
@@ -61,6 +61,14 @@ @article{lecun1989optimal
   year          = {1989}
 }
 
+@article{schizas2022tinyml,
+  author        = {Schizas, Nikolaos and Karras, Aristeidis and Karras, Christos and Sioutas, Spyros},
+  journal       = {Future Internet},
+  title         = {TinyML for Ultra-Low Power AI and Large Scale IoT Deployments: A Systematic Review},
+  doi           = {https://doi.org/10.3390/fi14120363},
+  year          = {2022}
+}
+
 @misc{han2016deep,
   archiveprefix = {arXiv},
   author        = {Han, Song and Mao, Huizi and Dally, William J.},

diff --git a/contents/efficient_ai/efficient_ai.qmd b/contents/efficient_ai/efficient_ai.qmd
@@ -36,9 +36,9 @@ Training models can consume a significant amount of energy, sometimes equivalent
 
 ## The Need for Efficient AI
 
-Efficiency takes on different connotations based on where AI computations occur. Let's take a brief moment to revisit and differentiate between Cloud, Edge, and TinyML in terms of efficiency.
+Efficiency takes on different connotations based on where AI computations occur. Let's take a brief moment to revisit and differentiate between Cloud, Edge, and TinyML in terms of efficiency. @fig-platforms provides a big picture comparison of the three different platforms.
 
-![Cloud, Mobile and TinyML.](https://www.mdpi.com/futureinternet/futureinternet-14-00363/article_deploy/html/images/futureinternet-14-00363-g001-550.jpg){#fig-platforms}
+![Cloud, Mobile and TinyML. Credit: @schizas2022tinyml.](https://www.mdpi.com/futureinternet/futureinternet-14-00363/article_deploy/html/images/futureinternet-14-00363-g001-550.jpg){#fig-platforms}
 
 For cloud AI, traditional AI models often ran in the large—scale data centers equipped with powerful GPUs and TPUs [@barroso2019datacenter]. Here, efficiency pertains to optimizing computational resources, reducing costs, and ensuring timely data processing and return. However, relying on the cloud introduced latency, especially when dealing with large data streams that needed to be uploaded, processed, and then downloaded.
 
@@ -62,13 +62,13 @@ Choosing the right model architecture is as crucial as optimizing it. In recent
 
 Model compression methods are very important for bringing deep learning models to devices with limited resources. These techniques reduce the size, energy consumption, and computational demands of models without a significant loss in accuracy. At a high level, the methods can briefly be binned into the following fundamental methods:
 
-**Pruning**: This is akin to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal]. This was later popularized in the context of deep learning by @han2016deep. In pruning, certain weights or even entire neurons are removed from the network, based on specific criteria. This can significantly reduce the model size. There are various strategies, like weight pruning, neuron pruning, and structured pruning. We will explore these in more detail in @sec-pruning. In the example in @fig-pruning, removing some of the nodes in the inner layers reduces the numbers of edges between the nodes and, in turn, the size of the model.
+**Pruning**: This is akin to trimming the branches of a tree. This was first thought of in the [Optimal Brain Damage](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf) paper [@lecun1989optimal]. This was later popularized in the context of deep learning by @han2016deep. In pruning, certain weights or even entire neurons are removed from the network, based on specific criteria. This can significantly reduce the model size. There are various strategies, like weight pruning, neuron pruning, and structured pruning. We will explore these in more detail in @sec-pruning. @fig-pruning is an examples of neural network pruning: removing some of the nodes in the inner layers (based on a specific criteria) reduces the numbers of edges between the nodes and, in turn, the size of the model.
 
-![Pruning applies different criteria that determine which nodes and/or weights can be removed without having significant impact on the model's performance.](images/jpg/pruning.jpeg){#fig-pruning}
+![Neural Network Pruning.](images/jpg/pruning.jpeg){#fig-pruning}
 
 **Quantization**: Quantization is the process of constraining an input from a large set to output in a smaller set, primarily in deep learning, this means reducing the number of bits that represent the weights and biases of the model. For example, using 16-bit or 8-bit representations instead of 32-bit can reduce model size and speed up computations, with a minor trade-off in accuracy. We will explore these in more detail in @sec-quant. @fig-quantization shows an example of quantization by rounding to the closest number. The conversion from 32-bit floating point to 16-bit reduces the memory usage by 50%. And going from 32-bit to 8-bit integer, memory is reduced by 75%. While the loss in numeric precision, and consequently model performance, is minor, the memory usage efficiency is very significant.
 
-![One method of quantization involves rounding to the nearest representable number. Quantization helps save on memory while minimizing performance loss.](images/jpg/quantization.jpeg){#fig-quantization}
+![Different forms of quantization.](images/jpg/quantization.jpeg){#fig-quantization}
 
 **Knowledge Distillation**: Knowledge distillation involves training a smaller model (student) to replicate the behavior of a larger model (teacher). The idea is to transfer the knowledge from the cumbersome model to the lightweight one, so the smaller model attains performance close to its larger counterpart but with significantly fewer parameters. We will explore knowledge distillation in more detail in the @sec-kd.
 
@@ -80,7 +80,7 @@ Model compression methods are very important for bringing deep learning models t
 
 [Edge TPUs](https://cloud.google.com/edge-tpu) are a smaller, power-efficient version of Google's TPUs, tailored for edge devices. They provide fast on-device ML inferencing for TensorFlow Lite models. Edge TPUs allow for low-latency, high-efficiency inference on edge devices like smartphones, IoT devices, and embedded systems. This means AI capabilities can be deployed in real-time applications without needing to communicate with a central server, thus saving bandwidth and reducing latency. Consider the table in @fig-edge-tpu-perf. It shows the performance differences of running different models on CPUs versus a Coral USB accelerator. The Coral USB accelerator is an accessory by Google's Coral AI platform that lets developrs connect Edge TPUs to Linux computers. Running inference on the Edge TPUs was 70 to 100 times faster than on CPUs.
 
-![Many applications require very high-performance inference, which can be achieved with on device accelerators such as Edge TPUs. Source: [TensorFlow Blog](https://blog.tensorflow.org/2019/03/build-ai-that-works-offline-with-coral.html)](images/png/tflite_edge_tpu_perf.png){#fig-edge-tpu-perf}
+![Accelerator vs CPU performance comparison. Credit: [TensorFlow Blog.](https://blog.tensorflow.org/2019/03/build-ai-that-works-offline-with-coral.html)](images/png/tflite_edge_tpu_perf.png){#fig-edge-tpu-perf}
 
 **NN Accelerators**: Fixed function neural network accelerators are hardware accelerators designed explicitly for neural network computations. These can be standalone chips or part of a larger system-on-chip (SoC) solution. By optimizing the hardware for the specific operations that neural networks require, such as matrix multiplications and convolutions, NN accelerators can achieve faster inference times and lower power consumption compared to general-purpose CPUs and GPUs. They are especially beneficial in TinyML devices with power or thermal constraints, such as smartwatches, micro-drones, or robotics.
 
@@ -104,7 +104,9 @@ There are also several other numerical formats that fall into an exotic calss. A
 
 By retaining the 8-bit exponent of FP32, BF16 offers a similar range, which is crucial for deep learning tasks where certain operations can result in very large or very small numbers. At the same time, by truncating precision, BF16 allows for reduced memory and computational requirements compared to FP32. BF16 has emerged as a promising middle ground in the landscape of numerical formats for deep learning, providing an efficient and effective alternative to the more traditional FP32 and FP16 formats.
 
-![Three floating-point formats. Source: [Google blog](google.com)](https://storage.googleapis.com/gweb-cloudblog-publish/images/Three_floating-point_formats.max-624x261.png){#fig-fp-formats}
+@fig-float-point-formats shows three different floating-point formats: Float32, Float16, and BFloat16.
+
+![Three floating-point formats.](images/jpg/three_float_types.jpeg){#fig-float-point-formats width=90%}
 
 **Integer**: These are integer representations using 8, 4, and 2 bits. They are often used during the inference phase of neural networks, where the weights and activations of the model are quantized to these lower precisions. Integer representations are deterministic and offer significant speed and memory advantages over floating-point representations. For many inference tasks, especially on edge devices, the slight loss in accuracy due to quantization is often acceptable given the efficiency gains. An extreme form of integer numerics is for binary neural networks (BNNs), where weights and activations are constrained to one of two values: either +1 or -1.
 
@@ -163,9 +165,9 @@ Moreover, the optimal model choice isn't always universal but often depends on t
 
 Another important consideration is the relationship between model complexity and its practical benefits. Take voice-activated assistants as an example such as "Alexa" or "OK Google." While a complex model might demonstrate a marginally superior understanding of user speech, if it's slower to respond than a simpler counterpart, the user experience could be compromised. Thus, adding layers or parameters doesn't always equate to better real-world outcomes.
 
-Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. So if a model was trained on pictures of stoves found in wealth countries only, it will fail to recognize stoves from poorer regions.
+Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. Stoves have different shapes and technological levels across different regions and income levels. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. So if a model was trained on pictures of stoves found in wealth countries only, it will fail to recognize stoves from poorer regions.
 
-![Objects, such as stoves, have different shapes and technological levels in differen regions. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. Source: Dollar Street stove images.](https://pbs.twimg.com/media/DmUyPSSW0AAChGa.jpg){#fig-stoves}
+![Different types of stoves. Credit: Dollar Street stove images.](https://pbs.twimg.com/media/DmUyPSSW0AAChGa.jpg){#fig-stoves}
 
 In essence, a thorough comparative analysis transcends numerical metrics. It's a holistic assessment, intertwined with real-world applications, costs, and the intricate subtleties that each model brings to the table. This is why it becomes important to have standard benchmarks and metrics that are widely established and adopted by the community.
 

diff --git a/contents/efficient_ai/images/jpg/three_float_types.jpeg b/contents/efficient_ai/images/jpg/three_float_types.jpeg
diff --git a/contents/optimizations/images/png/efficientnumerics_alexnet.png b/contents/optimizations/images/png/efficientnumerics_alexnet.png
diff --git a/contents/optimizations/images/png/efficientnumerics_edgequant.png b/contents/optimizations/images/png/efficientnumerics_edgequant.png
diff --git a/contents/optimizations/images/png/efficientnumerics_lecturenote.png b/contents/optimizations/images/png/efficientnumerics_lecturenote.png
diff --git a/contents/optimizations/images/png/modeloptimization_HW-NAS.png b/contents/optimizations/images/png/modeloptimization_HW-NAS.png
diff --git a/contents/optimizations/optimizations.bib b/contents/optimizations/optimizations.bib
@@ -462,6 +462,7 @@ @article{DBLP:journals/corr/abs-1909-05840
   biburl       = {https://dblp.org/rec/journals/corr/abs-1909-05840.bib},
   bibsource    = {dblp computer science bibliography, https://dblp.org}
 }
+
 @article{koren2009matrix,
   title={Matrix factorization techniques for recommender systems},
   author={Koren, Yehuda and Bell, Robert and Volinsky, Chris},
@@ -471,4 +472,24 @@ @article{koren2009matrix
   pages={30--37},
   year={2009},
   publisher={IEEE}
-}
+}
+
+@article{annette2020,
+  title={ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked Models},
+  author={Wess, Matthias and Ivanov, Matvey and Unger, Christoph and Nookala, Anvesh},
+  journal={IEEE},
+  doi={10.1109/ACCESS.2020.3047259},
+  year={2020},
+  publisher={IEEE}
+}
+
+@article{alexnet2012,
+ author = {Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E},
+ booktitle = {Advances in Neural Information Processing Systems},
+ editor = {F. Pereira and C.J. Burges and L. Bottou and K.Q. Weinberger},
+ publisher = {Curran Associates, Inc.},
+ title = {ImageNet Classification with Deep Convolutional Neural Networks},
+ url = {https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf},
+ volume = {25},
+ year = {2012}
+}