Expanded the algorithms section and added references

harvard-edge · Nov 30, 2023 · 20fe3b7 · 20fe3b7
1 parent c7be42a
commit 20fe3b7
Show file tree

Hide file tree

Showing 2 changed files with 78 additions and 9 deletions.
diff --git a/references.bib b/references.bib
@@ -16,6 +16,13 @@ @inproceedings{abadi2016deep
 	pages        = {308--318},
 }
 
+@article{ruder2016overview,
+  title={An overview of gradient descent optimization algorithms},
+  author={Ruder, Sebastian},
+  journal={arXiv preprint arXiv:1609.04747},
+  year={2016}
+}
+
 @inproceedings{abadi2016tensorflow,
 	title        = {$\{$TensorFlow\$\}\$: a system for \$\{\$Large-Scale\$\}\$ machine learning},
 	author       = {Abadi, Mart{\'\i}n and Barham, Paul and Chen, Jianmin and Chen, Zhifeng and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghemawat, Sanjay and Irving, Geoffrey and Isard, Michael and others},
@@ -33,6 +40,29 @@ @inproceedings{adolf2016fathom
 	organization = {IEEE},
 }
 
+@inproceedings{DBLP:journals/corr/KingmaB14,
+  author       = {Diederik P. Kingma and
+                  Jimmy Ba},
+  editor       = {Yoshua Bengio and
+                  Yann LeCun},
+  title        = {Adam: {A} Method for Stochastic Optimization},
+  booktitle    = {3rd International Conference on Learning Representations, {ICLR} 2015,
+                  San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings},
+  year         = {2015},
+  url          = {http://arxiv.org/abs/1412.6980},
+  timestamp    = {Thu, 25 Jul 2019 14:25:37 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/KingmaB14.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+
+@article{dahl2023benchmarking,
+  title={Benchmarking Neural Network Training Algorithms},
+  author={Dahl, George E and Schneider, Frank and Nado, Zachary and Agarwal, Naman and Sastry, Chandramouli Shama and Hennig, Philipp and Medapati, Sourabh and Eschenhagen, Runa and Kasimbeg, Priya and Suo, Daniel and others},
+  journal={arXiv preprint arXiv:2306.07179},
+  year={2023}
+}
+
+
 @article{afib,
 	title        = {Mobile Photoplethysmographic Technology to Detect Atrial Fibrillation},
 	author       = {Yutao Guo  and Hao Wang  and Hui Zhang  and Tong Liu  and Zhaoguang Liang  and Yunlong Xia  and Li Yan  and Yunli Xing  and Haili Shi  and Shuyan Li  and Yanxia Liu  and Fan Liu  and Mei Feng  and Yundai Chen  and Gregory Y.H. Lip  and null null},

diff --git a/training.qmd b/training.qmd
@@ -413,20 +413,59 @@ A good approach is to keep the validation set use minimal - hyperparameters can
 
 Care should be taken to not overfit when assessing performance on the validation set. Tradeoffs are needed to build models that perform well on the overall population, not overly tuned to the validation samples.
 
-## Optimization Algorithms
+Got it, here is the section with RMSProp added in:
 
-Stochastic Gradient Descent involves updating the model's parameters by considering the gradient of the loss function with respect to the parameters for each training example. While the basic concept of SGD is straightforward, finding the optimal set of parameters that minimizes the overall loss across the entire dataset may be difficult as the loss landscape is nonconvex.
+## Optimization Algorithms 
 
-To address the complexities of training neural networks, various optimization algorithms have been developed. These optimizers are designed to enhance the efficiency and convergence speed of the training process. They achieve this by adjusting the learning rates, incorporating momentum, and implementing adaptive strategies, among other techniques.
+Stochastic gradient descent (SGD) is a simple yet powerful optimization algorithm commonly used to train machine learning models. SGD works by estimating the gradient of the loss function with respect to the model parameters using a single training example, and then updating the parameters in the direction that reduces the loss.  
 
-Some optimizers include:
+While conceptually straightforward, SGD suffers from a few shortcomings. First, choosing a proper learning rate can be difficult - too small and progress is very slow, too large and parameters may oscillate and fail to converge. Second, SGD treats all parameters equally and independently, which may not be ideal in all cases. Finally, vanilla SGD uses only first order gradient information which results in slow progress on ill-conditioned problems.
 
-* ADAM
-* AdaGrad
-* RMSProp
-* Momentum SGD
+### Optimizations
 
-Generally, from our experience Adam is the most popular optimizer and usually outperforms SGD in terms of training neural networks.
+Over the years, various optimizations have been proposed to accelerate and improve upon vanilla SGD. @ruder2016overview gives an excellent overview of the different optimizers. Briefly, several commonly used SGD optimization techniques include:
+
+**Momentum:** Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.
+
+**Nesterov Accelerated Gradient (NAG):** A variant of momentum that computes gradients at the "look ahead" position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress. 
+
+**RMSProp:** Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates.
+
+**Adagrad:** An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates.
+
+**Adadelta:** A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates. 
+
+**Adam:** - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes.
+
+Of these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models. 
+
+### Trade-offs 
+
+Here is a pros and cons table for some of the main optimization algorithms for neural network training:
+
+| Algorithm | Pros | Cons |
+|-|-|-|
+| Momentum | Faster convergence due to acceleration along gradients Less oscillation than vanilla SGD | Requires tuning of momentum parameter |
+| Nesterov Accelerated Gradient (NAG) | Faster than standard momentum in some cases Anticipatory updates prevent overshooting | More complex to understand intuitively |   
+| Adagrad | Eliminates need to manually tune learning rates Performs well on sparse gradients | Learning rate may decay too quickly on dense gradients |
+| Adadelta | Less aggressive learning rate decay than Adagrad | Still sensitive to initial learning rate value |
+| RMSProp | Automatically adjusts learning rates Works well in practice | No major downsides |
+| Adam | Combination of momentum and adaptive learning rates Efficient and fast convergence | Slightly worse generalization performance in some cases |  
+| AMSGrad | Improvement to Adam addressing generalization issue | Not as extensively used/tested as Adam |
+
+### Benchmarking Algorithms
+
+No single method is best for all problem types. This means we need a  comprehensive benchmarking to identify the most effective optimizer for specific datasets and models. The performance of algorithms like Adam, RMSProp, and Momentum varies due to factors such as batch size, learning rate schedules, model architecture, data distribution, and regularization. These variations underline the importance of evaluating each optimizer under diverse conditions.
+
+Take Adam, for example, which often excels in computer vision tasks, in contrast to RMSProp that may show better generalization in certain natural language processing tasks. Momentum's strength lies in its acceleration in scenarios with consistent gradient directions, whereas Adagrad's adaptive learning rates are more suited for sparse gradient problems.
+
+This wide array of interactions among different optimizers demonstrates the challenge in declaring a single, universally superior algorithm. Each optimizer has unique strengths, making it crucial to empirically evaluate a range of methods to discover their optimal application conditions. 
+
+A comprehensive benchmarking approach should assess not just the speed of convergence but also factors like generalization error, stability, hyperparameter sensitivity, and computational efficiency, among others. This entails monitoring training and validation learning curves across multiple runs and comparing optimizers on a variety of datasets and models to understand their strengths and weaknesses.
+
+AlgoPerf, introduced by @dahl2023benchmarking, addresses the need for a robust benchmarking system. This platform evaluates optimizer performance using criteria such as training loss curves, generalization error, sensitivity to hyperparameters, and computational efficiency. AlgoPerf tests various optimization methods, including Adam, LAMB, and Adafactor, across different model types like CNNs and RNNs/LSTMs on established datasets. It utilizes containerization and automatic metric collection to minimize inconsistencies and allows for controlled experiments across thousands of configurations, providing a reliable basis for comparing different optimizers.
+
+The insights gained from AlgoPerf and similar benchmarks are invaluable for guiding the optimal choice or tuning of optimizers. By enabling reproducible evaluations, these benchmarks contribute to a deeper understanding of each optimizer's performance, paving the way for future innovations and accelerated progress in the field.
 
 ## Hyperparameter Tuning