Skip to content

Commit

Permalink
Expanded the algorithms section and added references
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Nov 30, 2023
1 parent c7be42a commit 20fe3b7
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 9 deletions.
30 changes: 30 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@ @inproceedings{abadi2016deep
pages = {308--318},
}

@article{ruder2016overview,
title={An overview of gradient descent optimization algorithms},
author={Ruder, Sebastian},
journal={arXiv preprint arXiv:1609.04747},
year={2016}
}

@inproceedings{abadi2016tensorflow,
title = {$\{$TensorFlow\$\}\$: a system for \$\{\$Large-Scale\$\}\$ machine learning},
author = {Abadi, Mart{\'\i}n and Barham, Paul and Chen, Jianmin and Chen, Zhifeng and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghemawat, Sanjay and Irving, Geoffrey and Isard, Michael and others},
Expand All @@ -33,6 +40,29 @@ @inproceedings{adolf2016fathom
organization = {IEEE},
}

@inproceedings{DBLP:journals/corr/KingmaB14,
author = {Diederik P. Kingma and
Jimmy Ba},
editor = {Yoshua Bengio and
Yann LeCun},
title = {Adam: {A} Method for Stochastic Optimization},
booktitle = {3rd International Conference on Learning Representations, {ICLR} 2015,
San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings},
year = {2015},
url = {http://arxiv.org/abs/1412.6980},
timestamp = {Thu, 25 Jul 2019 14:25:37 +0200},
biburl = {https://dblp.org/rec/journals/corr/KingmaB14.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{dahl2023benchmarking,
title={Benchmarking Neural Network Training Algorithms},
author={Dahl, George E and Schneider, Frank and Nado, Zachary and Agarwal, Naman and Sastry, Chandramouli Shama and Hennig, Philipp and Medapati, Sourabh and Eschenhagen, Runa and Kasimbeg, Priya and Suo, Daniel and others},
journal={arXiv preprint arXiv:2306.07179},
year={2023}
}


@article{afib,
title = {Mobile Photoplethysmographic Technology to Detect Atrial Fibrillation},
author = {Yutao Guo and Hao Wang and Hui Zhang and Tong Liu and Zhaoguang Liang and Yunlong Xia and Li Yan and Yunli Xing and Haili Shi and Shuyan Li and Yanxia Liu and Fan Liu and Mei Feng and Yundai Chen and Gregory Y.H. Lip and null null},
Expand Down
57 changes: 48 additions & 9 deletions training.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -413,20 +413,59 @@ A good approach is to keep the validation set use minimal - hyperparameters can

Care should be taken to not overfit when assessing performance on the validation set. Tradeoffs are needed to build models that perform well on the overall population, not overly tuned to the validation samples.

## Optimization Algorithms
Got it, here is the section with RMSProp added in:

Stochastic Gradient Descent involves updating the model's parameters by considering the gradient of the loss function with respect to the parameters for each training example. While the basic concept of SGD is straightforward, finding the optimal set of parameters that minimizes the overall loss across the entire dataset may be difficult as the loss landscape is nonconvex.
## Optimization Algorithms

To address the complexities of training neural networks, various optimization algorithms have been developed. These optimizers are designed to enhance the efficiency and convergence speed of the training process. They achieve this by adjusting the learning rates, incorporating momentum, and implementing adaptive strategies, among other techniques.
Stochastic gradient descent (SGD) is a simple yet powerful optimization algorithm commonly used to train machine learning models. SGD works by estimating the gradient of the loss function with respect to the model parameters using a single training example, and then updating the parameters in the direction that reduces the loss.

Some optimizers include:
While conceptually straightforward, SGD suffers from a few shortcomings. First, choosing a proper learning rate can be difficult - too small and progress is very slow, too large and parameters may oscillate and fail to converge. Second, SGD treats all parameters equally and independently, which may not be ideal in all cases. Finally, vanilla SGD uses only first order gradient information which results in slow progress on ill-conditioned problems.

* ADAM
* AdaGrad
* RMSProp
* Momentum SGD
### Optimizations

Generally, from our experience Adam is the most popular optimizer and usually outperforms SGD in terms of training neural networks.
Over the years, various optimizations have been proposed to accelerate and improve upon vanilla SGD. @ruder2016overview gives an excellent overview of the different optimizers. Briefly, several commonly used SGD optimization techniques include:

**Momentum:** Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.

**Nesterov Accelerated Gradient (NAG):** A variant of momentum that computes gradients at the "look ahead" position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress.

**RMSProp:** Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates.

**Adagrad:** An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates.

**Adadelta:** A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates.

**Adam:** - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes.

Of these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models.

### Trade-offs

Here is a pros and cons table for some of the main optimization algorithms for neural network training:

| Algorithm | Pros | Cons |
|-|-|-|
| Momentum | Faster convergence due to acceleration along gradients Less oscillation than vanilla SGD | Requires tuning of momentum parameter |
| Nesterov Accelerated Gradient (NAG) | Faster than standard momentum in some cases Anticipatory updates prevent overshooting | More complex to understand intuitively |
| Adagrad | Eliminates need to manually tune learning rates Performs well on sparse gradients | Learning rate may decay too quickly on dense gradients |
| Adadelta | Less aggressive learning rate decay than Adagrad | Still sensitive to initial learning rate value |
| RMSProp | Automatically adjusts learning rates Works well in practice | No major downsides |
| Adam | Combination of momentum and adaptive learning rates Efficient and fast convergence | Slightly worse generalization performance in some cases |
| AMSGrad | Improvement to Adam addressing generalization issue | Not as extensively used/tested as Adam |

### Benchmarking Algorithms

No single method is best for all problem types. This means we need a comprehensive benchmarking to identify the most effective optimizer for specific datasets and models. The performance of algorithms like Adam, RMSProp, and Momentum varies due to factors such as batch size, learning rate schedules, model architecture, data distribution, and regularization. These variations underline the importance of evaluating each optimizer under diverse conditions.

Take Adam, for example, which often excels in computer vision tasks, in contrast to RMSProp that may show better generalization in certain natural language processing tasks. Momentum's strength lies in its acceleration in scenarios with consistent gradient directions, whereas Adagrad's adaptive learning rates are more suited for sparse gradient problems.

This wide array of interactions among different optimizers demonstrates the challenge in declaring a single, universally superior algorithm. Each optimizer has unique strengths, making it crucial to empirically evaluate a range of methods to discover their optimal application conditions.

A comprehensive benchmarking approach should assess not just the speed of convergence but also factors like generalization error, stability, hyperparameter sensitivity, and computational efficiency, among others. This entails monitoring training and validation learning curves across multiple runs and comparing optimizers on a variety of datasets and models to understand their strengths and weaknesses.

AlgoPerf, introduced by @dahl2023benchmarking, addresses the need for a robust benchmarking system. This platform evaluates optimizer performance using criteria such as training loss curves, generalization error, sensitivity to hyperparameters, and computational efficiency. AlgoPerf tests various optimization methods, including Adam, LAMB, and Adafactor, across different model types like CNNs and RNNs/LSTMs on established datasets. It utilizes containerization and automatic metric collection to minimize inconsistencies and allows for controlled experiments across thousands of configurations, providing a reliable basis for comparing different optimizers.

The insights gained from AlgoPerf and similar benchmarks are invaluable for guiding the optimal choice or tuning of optimizers. By enabling reproducible evaluations, these benchmarks contribute to a deeper understanding of each optimizer's performance, paving the way for future innovations and accelerated progress in the field.

## Hyperparameter Tuning

Expand Down

0 comments on commit 20fe3b7

Please sign in to comment.