index.html

<h1 id="papers-that-use-sparsity-in-deep-learning">Papers that use sparsity in deep learning</h1>
<p>This is a list of papers curated for the paper “Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks”.</p>
<p>The following list is automatically generated from <code>sparsity.bib</code>. To contribute to this list, please set up a Pull Request and add new bibtex entries.</p>
<h2 id="papers" class="unnumbered">Papers</h2>
<div id="refs" class="references">
<div id="ref-achille2019critical">
<p>Achille, Alessandro, Matteo Rovere, and Stefano Soatto. 2019. “Critical Learning Periods in Deep Neural Networks.” <a href="http://arxiv.org/abs/1711.08856">http://arxiv.org/abs/1711.08856</a>.</p>
</div>
<div id="ref-2020-afghan">
<p>Afghan, Sher, and Uwe Naumann. 2020. “Interval Adjoint Significance Analysis for Neural Networks.” In <em>International Conference on Computational Science</em>, 365–78. Springer.</p>
</div>
<div id="ref-2016-aghasi">
<p>Aghasi, Alireza, Afshin Abdi, Nam Nguyen, and Justin Romberg. 2017. “Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee.” <a href="http://arxiv.org/abs/1611.05162">http://arxiv.org/abs/1611.05162</a>.</p>
</div>
<div id="ref-ahmad2019dense">
<p>Ahmad, Subutai, and Luiz Scheinkman. 2019. “How Can We Be so Dense? The Benefits of Using Highly Sparse Representations.” <a href="http://arxiv.org/abs/1903.11257">http://arxiv.org/abs/1903.11257</a>.</p>
</div>
<div id="ref-2017-aji">
<p>Aji, Alham Fikriand, and Kenneth Heafield. 2017. “Sparse Communication for Distributed Gradient Descent.” In <em>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</em>, 440–45. <a href="http://arxiv.org/abs/1704.05021">http://arxiv.org/abs/1704.05021</a>.</p>
</div>
<div id="ref-2016-albericio">
<p>Albericio, J., P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2016. “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.” In <em>2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca)</em>, 1–13. <a href="https://doi.org/10.1109/ISCA.2016.11">https://doi.org/10.1109/ISCA.2016.11</a>.</p>
</div>
<div id="ref-alistarh2017qsgd">
<p>Alistarh, Dan, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. “QSGD: Communication-Efficient Sgd via Gradient Quantization and Encoding.” <a href="http://arxiv.org/abs/1610.02132">http://arxiv.org/abs/1610.02132</a>.</p>
</div>
<div id="ref-2018-alistarh">
<p>Alistarh, Dan, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. “The Convergence of Sparsified Gradient Methods.” In <em>Advances in Neural Information Processing Systems</em>, 5973–83. <a href="http://arxiv.org/abs/1809.10505">http://arxiv.org/abs/1809.10505</a>.</p>
</div>
<div id="ref-allenzhu2019convergence">
<p>Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2019. “A Convergence Theory for Deep Learning via over-Parameterization.” <a href="http://arxiv.org/abs/1811.03962">http://arxiv.org/abs/1811.03962</a>.</p>
</div>
<div id="ref-almahairi2016dynamic">
<p>Almahairi, Amjad, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. “Dynamic Capacity Networks.” <a href="http://arxiv.org/abs/1511.07838">http://arxiv.org/abs/1511.07838</a>.</p>
</div>
<div id="ref-2017-alvarez">
<p>Alvarez, Jose M., and Mathieu Salzmann. 2017. “Compression-Aware Training of Deep Networks.” <a href="http://arxiv.org/abs/1711.02638">http://arxiv.org/abs/1711.02638</a>.</p>
</div>
<div id="ref-2016-alwani">
<p>Alwani, Manoj, Han Chen, Michael Ferdman, and Peter Milder. 2016. “Fused-Layer Cnn Accelerators.” In <em>The 49th Annual Ieee/Acm International Symposium on Microarchitecture</em>, 22. IEEE Press.</p>
</div>
<div id="ref-1998-amari">
<p>Amari, Shun-ichi. 1998. “Natural Gradient Works Efficiently in Learning.” <em>Neural Computation</em> 10 (2): 251–76. <a href="https://doi.org/10.1162/089976698300017746">https://doi.org/10.1162/089976698300017746</a>.</p>
</div>
<div id="ref-2015-anwar">
<p>Anwar, Sajid, Kyuyeon Hwang, and Wonyong Sung. 2017. “Structured Pruning of Deep Convolutional Neural Networks.” <em>ACM Journal on Emerging Technologies in Computing Systems (JETC)</em> 13 (3): 1–18.</p>
</div>
<div id="ref-2020-atashgahi">
<p>Atashgahi, Zahra, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2020. “Quick and Robust Feature Selection: The Strength of Energy-Efficient Sparse Training for Autoencoders.” <a href="http://arxiv.org/abs/2012.00560">http://arxiv.org/abs/2012.00560</a>.</p>
</div>
<div id="ref-2020-azarian">
<p>Azarian, Kambiz, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. 2020. “Learned Threshold Pruning.” <a href="http://arxiv.org/abs/2003.00075">http://arxiv.org/abs/2003.00075</a>.</p>
</div>
<div id="ref-2016-ba">
<p>Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations.”</p>
</div>
<div id="ref-2016-ba-layernorm">
<p>Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” <a href="http://arxiv.org/abs/1607.06450">http://arxiv.org/abs/1607.06450</a>.</p>
</div>
<div id="ref-2020-baalen">
<p>Baalen, Mart van, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. “Bayesian Bits: Unifying Quantization and Pruning.” <a href="http://arxiv.org/abs/2005.07093">http://arxiv.org/abs/2005.07093</a>.</p>
</div>
<div id="ref-2013-baldi">
<p>Baldi, Pierre, and Peter J Sadowski. 2013. “Understanding Dropout.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 26:2814–22. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2013/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">https://proceedings.neurips.cc/paper/2013/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf</a>.</p>
</div>
<div id="ref-2019-bartoldson">
<p>Bartoldson, Brian R., Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. “The Generalization-Stability Tradeoff in Neural Network Pruning.” <a href="http://arxiv.org/abs/1906.03728">http://arxiv.org/abs/1906.03728</a>.</p>
</div>
<div id="ref-2020-basu">
<p>Basu, Debraj, Deepesh Data, Can Karakus, and Suhas N Diggavi. 2020. “Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations.” <em>IEEE Journal on Selected Areas in Information Theory</em> 1 (1): 217–26. <a href="http://arxiv.org/abs/1906.02367">http://arxiv.org/abs/1906.02367</a>.</p>
</div>
<div id="ref-2018-baykal">
<p>Baykal, Cenk, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. 2018. “Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds.” <em>arXiv Preprint arXiv:1804.05345</em>.</p>
</div>
<div id="ref-ista">
<p>Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” <em>SIAM J. Img. Sci.</em> 2 (1): 183–202. <a href="https://doi.org/10.1137/080716542">https://doi.org/10.1137/080716542</a>.</p>
</div>
<div id="ref-2018-bellec">
<p>Bellec, Guillaume, David Kappel, Wolfgang Maass, and Robert Legenstein. 2018. “Deep Rewiring: Training Very Sparse Deep Networks.” <a href="http://arxiv.org/abs/1711.05136">http://arxiv.org/abs/1711.05136</a>.</p>
</div>
<div id="ref-beltagy2020longformer">
<p>Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” <a href="http://arxiv.org/abs/2004.05150">http://arxiv.org/abs/2004.05150</a>.</p>
</div>
<div id="ref-bengio2016conditional">
<p>Bengio, Emmanuel, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. “Conditional Computation in Neural Networks for Faster Models.” <a href="http://arxiv.org/abs/1511.06297">http://arxiv.org/abs/1511.06297</a>.</p>
</div>
<div id="ref-2013-bengio">
<p>Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. 2013. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” <a href="http://arxiv.org/abs/1308.3432">http://arxiv.org/abs/1308.3432</a>.</p>
</div>
<div id="ref-bennun2019modular">
<p>Ben-Nun, Tal, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoefler. 2019. “A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning.” <a href="http://arxiv.org/abs/1901.10183">http://arxiv.org/abs/1901.10183</a>.</p>
</div>
<div id="ref-bennun2018demystifying">
<p>Ben-Nun, Tal, and Torsten Hoefler. 2018. “Demystifying Parallel and Distributed Deep Learning: An in-Depth Concurrency Analysis.” <a href="http://arxiv.org/abs/1802.09941">http://arxiv.org/abs/1802.09941</a>.</p>
</div>
<div id="ref-betzel2017modular">
<p>Betzel, Richard F, John D Medaglia, Lia Papadopoulos, Graham L Baum, Ruben Gur, Raquel Gur, David Roalf, Theodore D Satterthwaite, and Danielle S Bassett. 2017. “The Modular Organization of Human Anatomical Brain Networks: Accounting for the Cost of Wiring.” <em>Network Neuroscience</em> 1 (1): 42–68.</p>
</div>
<div id="ref-2018-bianco">
<p>Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. “Benchmark Analysis of Representative Deep Neural Network Architectures.” <em>IEEE Access</em> 6: 64270–7. <a href="https://doi.org/10.1109/access.2018.2877890">https://doi.org/10.1109/access.2018.2877890</a>.</p>
</div>
<div id="ref-2020-blalock">
<p>Blalock, Davis, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. “What Is the State of Neural Network Pruning?” <a href="http://arxiv.org/abs/2003.03033">http://arxiv.org/abs/2003.03033</a>.</p>
</div>
<div id="ref-2017-bourely">
<p>Bourely, Alfred, John Patrick Boueri, and Krzysztof Choromonski. 2017. “Sparse Neural Networks Topologies.” <a href="http://arxiv.org/abs/1706.05683">http://arxiv.org/abs/1706.05683</a>.</p>
</div>
<div id="ref-gpt-3">
<p>Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In <em>Advances in Neural Information Processing Systems</em>. <a href="http://arxiv.org/abs/2005.14165">http://arxiv.org/abs/2005.14165</a>.</p>
</div>
<div id="ref-brutzkus2017sgd">
<p>Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2017. “SGD Learns over-Parameterized Networks That Provably Generalize on Linearly Separable Data.” <a href="http://arxiv.org/abs/1710.10174">http://arxiv.org/abs/1710.10174</a>.</p>
</div>
<div id="ref-713928">
<p>Burrascano, P. 1993. “A Pruning Technique Maximizing Generalization.” In <em>Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan)</em>, 1:347–50 vol.1. <a href="https://doi.org/10.1109/IJCNN.1993.713928">https://doi.org/10.1109/IJCNN.1993.713928</a>.</p>
</div>
<div id="ref-2018-carreira">
<p>Carreira-Perpinan, M. A., and Y. Idelbayev. 2018. “"Learning-Compression" Algorithms for Neural Net Pruning.” In <em>2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition</em>, 8532–41. <a href="https://doi.org/10.1109/CVPR.2018.00890">https://doi.org/10.1109/CVPR.2018.00890</a>.</p>
</div>
<div id="ref-1997-castellano">
<p>Castellano, G., A. M. Fanelli, and M. Pelillo. 1997. “An Iterative Pruning Algorithm for Feedforward Neural Networks.” <em>IEEE Transactions on Neural Networks</em> 8 (3): 519–31. <a href="https://doi.org/10.1109/72.572092">https://doi.org/10.1109/72.572092</a>.</p>
</div>
<div id="ref-2000-castellano">
<p>Castellano, Giovanna, and Anna Maria Fanelli. 2000. “Variable Selection Using Neural-Network Models.” <em>Neurocomputing</em> 31 (1-4): 1–13.</p>
</div>
<div id="ref-2000-chandrasekaran">
<p>Chandrasekaran, Hema, Hung-Han Chen, and Michael T. Manry. 2000. “Pruning of Basis Functions in Nonlinear Approximators.” <em>Neurocomputing</em> 34 (1): 29–53. <a href="https://doi.org/https://doi.org/10.1016/S0925-2312(00)00311-8">https://doi.org/https://doi.org/10.1016/S0925-2312(00)00311-8</a>.</p>
</div>
<div id="ref-2017-changpinyo">
<p>Changpinyo, Soravit, Mark Sandler, and Andrey Zhmoginov. 2017. “The Power of Sparsity in Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1702.06257">http://arxiv.org/abs/1702.06257</a>.</p>
</div>
<div id="ref-2020-chao">
<p>Chao, Shih-Kang, Zhanyu Wang, Yue Xing, and Guang Cheng. 2020. “Directional Pruning of Deep Neural Networks.” <a href="http://arxiv.org/abs/2006.09358">http://arxiv.org/abs/2006.09358</a>.</p>
</div>
<div id="ref-1988-chauvin">
<p>Chauvin, Yves. 1989. “A Back-Propagation Algorithm with Optimal Use of Hidden Units.” In <em>Advances in Neural Information Processing Systems 1</em>, 519–26. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-im2col">
<p>Chellapilla, Kumar, Sidd Puri, and Patrice Simard. 2006. “High Performance Convolutional Neural Networks for Document Processing.” In.</p>
</div>
<div id="ref-2018-chen">
<p>Chen, Chia-Yu, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2017. “AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training.” In <em>32nd Aaai Conference on Artificial Intelligence</em>, 2827–35. <a href="http://arxiv.org/abs/1712.02679">http://arxiv.org/abs/1712.02679</a>.</p>
</div>
<div id="ref-2020-chen-rl">
<p>Chen, Jianda, Shangyu Chen, and Sinno Jialin Pan. 2020. “Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning.” <em>Advances in Neural Information Processing Systems</em> 33.</p>
</div>
<div id="ref-2020-chen">
<p>Chen, Tianlong, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. “The Lottery Ticket Hypothesis for Pre-Trained Bert Networks.” <a href="http://arxiv.org/abs/2007.12223">http://arxiv.org/abs/2007.12223</a>.</p>
</div>
<div id="ref-2016-chen">
<p>Chen, Y., T. Krishna, J. S. Emer, and V. Sze. 2017. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” <em>IEEE Journal of Solid-State Circuits</em> 52 (1): 127–38. <a href="https://doi.org/10.1109/JSSC.2016.2616357">https://doi.org/10.1109/JSSC.2016.2616357</a>.</p>
</div>
<div id="ref-2019-chen">
<p>Chen, Yu-Hsin, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. “Eyeriss V2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices.” <a href="http://arxiv.org/abs/1807.07928">http://arxiv.org/abs/1807.07928</a>.</p>
</div>
<div id="ref-cheng2020survey">
<p>Cheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2020. “A Survey of Model Compression and Acceleration for Deep Neural Networks.” <a href="http://arxiv.org/abs/1710.09282">http://arxiv.org/abs/1710.09282</a>.</p>
</div>
<div id="ref-2019-abdellatif">
<p>Chérief-Abdellatif, Badr-Eddine. 2019. “Convergence Rates of Variational Inference in Sparse Deep Learning.” <a href="http://arxiv.org/abs/1908.04847">http://arxiv.org/abs/1908.04847</a>.</p>
</div>
<div id="ref-2014-chetlur">
<p>Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: Efficient Primitives for Deep Learning.” <a href="http://arxiv.org/abs/1410.0759">http://arxiv.org/abs/1410.0759</a>.</p>
</div>
<div id="ref-child2019generating">
<p>Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. “Generating Long Sequences with Sparse Transformers.” <a href="http://arxiv.org/abs/1904.10509">http://arxiv.org/abs/1904.10509</a>.</p>
</div>
<div id="ref-2020-cho">
<p>Cho, Minsu, Ameya Joshi, and Chinmay Hegde. 2020. “ESPN: Extremely Sparse Pruned Networks.” <a href="http://arxiv.org/abs/2006.15741">http://arxiv.org/abs/2006.15741</a>.</p>
</div>
<div id="ref-choudhary2020comprehensive">
<p>Choudhary, Tejalal, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. “A Comprehensive Survey on Model Compression and Acceleration.” <em>Artificial Intelligence Review</em>, 1–43.</p>
</div>
<div id="ref-1996-cibas">
<p>Cibas, Tautvydas, Françoise Fogelman Soulié, Patrick Gallinari, and Sarunas Raudys. 1996. “Variable Selection with Neural Networks.” <em>Neurocomputing</em> 12 (2): 223–48. <a href="https://doi.org/https://doi.org/10.1016/0925-2312(95)00121-2">https://doi.org/https://doi.org/10.1016/0925-2312(95)00121-2</a>.</p>
</div>
<div id="ref-2016-cohen">
<p>Cohen, Joseph Paul, Henry Z. Lo, and Wei Ding. 2017. “RandomOut: Using a Convolutional Gradient Norm to Rescue Convolutional Filters.” <a href="http://arxiv.org/abs/1602.05931">http://arxiv.org/abs/1602.05931</a>.</p>
</div>
<div id="ref-2014-collins">
<p>Collins, Maxwell D., and Pushmeet Kohli. 2014. “Memory Bounded Deep Convolutional Networks.” <em>CoRR</em> abs/1412.1442. <a href="http://arxiv.org/abs/1412.1442">http://arxiv.org/abs/1412.1442</a>.</p>
</div>
<div id="ref-2019-correia">
<p>Correia, Gonçalo M, Vlad Niculae, and André FT Martins. 2019. “Adaptively Sparse Transformers.” In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp)</em>. <a href="http://arxiv.org/abs/1909.00015">http://arxiv.org/abs/1909.00015</a>.</p>
</div>
<div id="ref-cosentino2019search">
<p>Cosentino, Justin, Federico Zaiter, Dan Pei, and Jun Zhu. 2019. “The Search for Sparse, Robust Neural Networks.” <a href="http://arxiv.org/abs/1912.02386">http://arxiv.org/abs/1912.02386</a>.</p>
</div>
<div id="ref-2019-cui">
<p>Cui, Baiyun, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. “Fine-Tune BERT with Sparse Self-Attention Mechanism.” In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp)</em>, 3539–44.</p>
</div>
<div id="ref-2018-dai">
<p>Dai, Bin, Chen Zhu, and David Wipf. 2018. “Compressing Neural Networks Using the Variational Information Bottleneck.” <a href="http://arxiv.org/abs/1802.10399">http://arxiv.org/abs/1802.10399</a>.</p>
</div>
<div id="ref-2019-dai">
<p>Dai, Xiaoliang, Hongxu Yin, and Niraj K. Jha. 2018. “NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm.” <a href="http://arxiv.org/abs/1711.02017">http://arxiv.org/abs/1711.02017</a>.</p>
</div>
<div id="ref-2020-dascoli">
<p>d’Ascoli, Stéphane, Levent Sagun, Joan Bruna, and Giulio Biroli. 2020. “Finding the Needle in the Haystack with Convolutions: On the Benefits of Architectural Bias.” <a href="http://arxiv.org/abs/1906.06766">http://arxiv.org/abs/1906.06766</a>.</p>
</div>
<div id="ref-dave2020hardware">
<p>Dave, Shail, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. “Hardware Acceleration of Sparse and Irregular Tensor Computations of Ml Models: A Survey and Insights.” <a href="http://arxiv.org/abs/2007.00864">http://arxiv.org/abs/2007.00864</a>.</p>
</div>
<div id="ref-2020-davies">
<p>Davies, Peter, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. 2020. “Distributed Variance Reduction with Optimal Communication.” <a href="http://arxiv.org/abs/2002.09268">http://arxiv.org/abs/2002.09268</a>.</p>
</div>
<div id="ref-9043731">
<p>Deng, L., G. Li, S. Han, L. Shi, and Y. Xie. 2020. “Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey.” <em>Proceedings of the IEEE</em> 108 (4): 485–532. <a href="https://doi.org/10.1109/JPROC.2020.2976475">https://doi.org/10.1109/JPROC.2020.2976475</a>.</p>
</div>
<div id="ref-denil2014predicting">
<p>Denil, Misha, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. 2014. “Predicting Parameters in Deep Learning.” <a href="http://arxiv.org/abs/1306.0543">http://arxiv.org/abs/1306.0543</a>.</p>
</div>
<div id="ref-2014-denton">
<p>Denton, Emily L, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. “Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation.” <em>Advances in Neural Information Processing Systems</em> 27: 1269–77.</p>
</div>
<div id="ref-2019-dettmers">
<p>Dettmers, Tim, and Luke Zettlemoyer. 2019. “Sparse Networks from Scratch: Faster Training Without Losing Performance.” <a href="http://arxiv.org/abs/1907.04840">http://arxiv.org/abs/1907.04840</a>.</p>
</div>
<div id="ref-de2017ultrastructural">
<p>De Vivo, Luisa, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio Tononi, and Chiara Cirelli. 2017. “Ultrastructural Evidence for Synaptic Scaling Across the Wake/Sleep Cycle.” <em>Science</em> 355 (6324): 507–10.</p>
</div>
<div id="ref-2019-devlin">
<p>Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</em>, 4171–86.</p>
</div>
<div id="ref-2018-dey">
<p>Dey, S., K. Huang, P. A. Beerel, and K. M. Chugg. 2019. “Pre-Defined Sparse Neural Networks with Hardware Acceleration.” <em>IEEE Journal on Emerging and Selected Topics in Circuits and Systems</em> 9 (2): 332–45. <a href="https://doi.org/10.1109/JETCAS.2019.2910864">https://doi.org/10.1109/JETCAS.2019.2910864</a>.</p>
</div>
<div id="ref-diering2017homer1a">
<p>Diering, Graham H, Raja S Nirujogi, Richard H Roth, Paul F Worley, Akhilesh Pandey, and Richard L Huganir. 2017. “Homer1a Drives Homeostatic Scaling-down of Excitatory Synapses During Sleep.” <em>Science</em> 355 (6324): 511–15.</p>
</div>
<div id="ref-2019-ding-c-sgd">
<p>Ding, Xiaohan, Guiguang Ding, Yuchen Guo, and Jungong Han. 2019. “Centripetal Sgd for Pruning Very Deep Convolutional Networks with Complicated Structure.” <a href="http://arxiv.org/abs/1904.03837">http://arxiv.org/abs/1904.03837</a>.</p>
</div>
<div id="ref-2019-ding">
<p>Ding, Xiaohan, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. 2019. “Global Sparse Momentum Sgd for Pruning Very Deep Neural Networks.” <a href="http://arxiv.org/abs/1909.12778">http://arxiv.org/abs/1909.12778</a>.</p>
</div>
<div id="ref-2005-dolan">
<p>Dolan, William B, and Chris Brockett. 2005. “Automatically Constructing a Corpus of Sentential Paraphrases.” In <em>Proceedings of the Third International Workshop on Paraphrasing (Iwp2005)</em>.</p>
</div>
<div id="ref-domingos2020model">
<p>Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” <a href="http://arxiv.org/abs/2012.00152">http://arxiv.org/abs/2012.00152</a>.</p>
</div>
<div id="ref-2019-dong">
<p>Dong, Xiao, Lei Liu, Guangli Li, Jiansong Li, Peng Zhao, Xueying Wang, and Xiaobing Feng. 2019. “Exploiting the Input Sparsity to Accelerate Deep Neural Networks: Poster.” In <em>Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Ppopp 2019, Washington, Dc, Usa, February 16-20, 2019</em>, 401–2. <a href="https://doi.org/10.1145/3293883.3295713">https://doi.org/10.1145/3293883.3295713</a>.</p>
</div>
<div id="ref-2017-dong">
<p>Dong, Xin, Shangyu Chen, and Sinno Jialin Pan. 2017. “Learning to Prune Deep Neural Networks via Layer-Wise Optimal Brain Surgeon.” <a href="http://arxiv.org/abs/1705.07565">http://arxiv.org/abs/1705.07565</a>.</p>
</div>
<div id="ref-2020-dosovitskiy">
<p>Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In <em>Proceedings of the Ninth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/2010.11929">http://arxiv.org/abs/2010.11929</a>.</p>
</div>
<div id="ref-2016-dryden">
<p>Dryden, Nikoli, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. “Communication Quantization for Data-Parallel Training of Deep Neural Networks.” In <em>2nd Workshop on Machine Learning in Hpc Environments (Mlhpc)</em>, 1–8.</p>
</div>
<div id="ref-du2019gradient">
<p>Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. “Gradient Descent Provably Optimizes over-Parameterized Neural Networks.” <a href="http://arxiv.org/abs/1810.02054">http://arxiv.org/abs/1810.02054</a>.</p>
</div>
<div id="ref-2020-dutta">
<p>Dutta, Aritra, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. “On the Discrepancy Between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning.” In <em>Proceedings of the Aaai Conference on Artificial Intelligence</em>, 34:3817–24. 04. <a href="http://arxiv.org/abs/1911.08250">http://arxiv.org/abs/1911.08250</a>.</p>
</div>
<div id="ref-2020-elsen">
<p>Elsen, Erich, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2019. “Fast Sparse Convnets.” <a href="http://arxiv.org/abs/1911.09723">http://arxiv.org/abs/1911.09723</a>.</p>
</div>
<div id="ref-elsken2019neural">
<p>Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. 2019. “Neural Architecture Search: A Survey.” <a href="http://arxiv.org/abs/1808.05377">http://arxiv.org/abs/1808.05377</a>.</p>
</div>
<div id="ref-1995-engelbrecht">
<p>Engelbrecht, Andries Petrus, Ian Cloete, and Jacek M Zurada. 1995. “Determining the Significance of Input Parameters Using Sensitivity Analysis.” In <em>International Workshop on Artificial Neural Networks</em>, 382–88. Springer.</p>
</div>
<div id="ref-2001-engelbrecht">
<p>Engelbrecht, A. P. 2001. “A New Pruning Heuristic Based on Variance Analysis of Sensitivity Information.” <em>IEEE Transactions on Neural Networks</em> 12 (6): 1386–99. <a href="https://doi.org/10.1109/72.963775">https://doi.org/10.1109/72.963775</a>.</p>
</div>
<div id="ref-1996-engelbrecht">
<p>Engelbrecht, A. P., and I. Cloete. 1996. “A Sensitivity Analysis Algorithm for Pruning Feedforward Neural Networks.” In <em>Proceedings of International Conference on Neural Networks (Icnn’96)</em>, 2:1274–8 vol.2. <a href="https://doi.org/10.1109/ICNN.1996.549081">https://doi.org/10.1109/ICNN.1996.549081</a>.</p>
</div>
<div id="ref-2020-evci">
<p>Evci, Utku, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. “Rigging the Lottery: Making All Tickets Winners.” <a href="http://arxiv.org/abs/1911.11134">http://arxiv.org/abs/1911.11134</a>.</p>
</div>
<div id="ref-2020-evci-gradient-flow">
<p>Evci, Utku, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. 2020. “Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win.” <a href="http://arxiv.org/abs/2010.03533">http://arxiv.org/abs/2010.03533</a>.</p>
</div>
<div id="ref-2020-evci-difficult">
<p>Evci, Utku, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. 2020. “The Difficulty of Training Sparse Neural Networks.” <a href="http://arxiv.org/abs/1906.10732">http://arxiv.org/abs/1906.10732</a>.</p>
</div>
<div id="ref-2020-fan">
<p>Fan, Angela, Edouard Grave, and Armand Joulin. 2020. “Reducing Transformer Depth on Demand with Structured Dropout.” In <em>Proceedings of the Eighth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1909.11556">http://arxiv.org/abs/1909.11556</a>.</p>
</div>
<div id="ref-2021-fedus">
<p>Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” <a href="http://arxiv.org/abs/2101.03961">http://arxiv.org/abs/2101.03961</a>.</p>
</div>
<div id="ref-1992-finnoff">
<p>Finnoff, William, Ferdinand Hergert, and Hans Georg Zimmermann. 1993. “Improving Model Selection by Nonconvergent Methods.” <em>Neural Networks</em> 6 (6): 771–83.</p>
</div>
<div id="ref-1998-fletcher">
<p>Fletcher, L., V. Katkovnik, F. E. Steffens, and A. P. Engelbrecht. 1998. “Optimizing the Number of Hidden Nodes of a Feedforward Artificial Neural Network.” In <em>1998 Ieee International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227)</em>, 2:1608–12 vol.2. <a href="https://doi.org/10.1109/IJCNN.1998.686018">https://doi.org/10.1109/IJCNN.1998.686018</a>.</p>
</div>
<div id="ref-2019-frankle">
<p>Frankle, Jonathan, and Michael Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” <a href="http://arxiv.org/abs/1803.03635">http://arxiv.org/abs/1803.03635</a>.</p>
</div>
<div id="ref-2020-frankle-linear">
<p>Frankle, Jonathan, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020a. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” <a href="http://arxiv.org/abs/1912.05671">http://arxiv.org/abs/1912.05671</a>.</p>
</div>
<div id="ref-2019-frankle-b">
<p>———. 2020b. “Stabilizing the Lottery Ticket Hypothesis.” <a href="http://arxiv.org/abs/1903.01611">http://arxiv.org/abs/1903.01611</a>.</p>
</div>
<div id="ref-2020-frankle-missing">
<p>———. 2021. “Pruning Neural Networks at Initialization: Why Are We Missing the Mark?” <a href="http://arxiv.org/abs/2009.08576">http://arxiv.org/abs/2009.08576</a>.</p>
</div>
<div id="ref-2020-frankle-early">
<p>Frankle, Jonathan, David J. Schwab, and Ari S. Morcos. 2020. “The Early Phase of Neural Network Training.” <a href="http://arxiv.org/abs/2002.10365">http://arxiv.org/abs/2002.10365</a>.</p>
</div>
<div id="ref-sparse_group_lasso">
<p>Friedman, J., T. Hastie, and R. Tibshirani. 2010. “A Note on the Group Lasso and a Sparse Group Lasso.” <a href="http://arxiv.org/abs/1001.0736">http://arxiv.org/abs/1001.0736</a>.</p>
</div>
<div id="ref-karl_hierarchical_models">
<p>Friston, K.J. 2008. “Hierarchical Models in the Brain.” <em>PLOS Computational Biology</em> 4 (11): e1000211. <a href="https://doi.org/10.1371/journal.pcbi.1000211">https://doi.org/10.1371/journal.pcbi.1000211</a>.</p>
</div>
<div id="ref-2019-gaier">
<p>Gaier, Adam, and David Ha. 2019. “Weight Agnostic Neural Networks.” <a href="http://arxiv.org/abs/1906.04358">http://arxiv.org/abs/1906.04358</a>.</p>
</div>
<div id="ref-2016-gal">
<p>Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In <em>Proceedings of the 33rd International Conference on Machine Learning</em>, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:1050–9. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. <a href="http://proceedings.mlr.press/v48/gal16.html">http://proceedings.mlr.press/v48/gal16.html</a>.</p>
</div>
<div id="ref-2017-gal">
<p>Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” In <em>Advances in Neural Information Processing Systems</em>, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:3581–90. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2017/file/84ddfb34126fc3a48ee38d7044e87276-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/84ddfb34126fc3a48ee38d7044e87276-Paper.pdf</a>.</p>
</div>
<div id="ref-2019-gale">
<p>Gale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of Sparsity in Deep Neural Networks.” <a href="http://arxiv.org/abs/1902.09574">http://arxiv.org/abs/1902.09574</a>.</p>
</div>
<div id="ref-2020-gale">
<p>Gale, Trevor, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. “Sparse Gpu Kernels for Deep Learning.” <a href="http://arxiv.org/abs/2006.10901">http://arxiv.org/abs/2006.10901</a>.</p>
</div>
<div id="ref-2020-ganesh">
<p>Ganesh, Prakhar, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.” <a href="http://arxiv.org/abs/2002.11985">http://arxiv.org/abs/2002.11985</a>.</p>
</div>
<div id="ref-ge2011note">
<p>Ge, Dongdong, Xiaoye Jiang, and Yinyu Ye. 2011. “A Note on the Complexity of L P Minimization.” <em>Mathematical Programming</em> 129 (2): 285–99.</p>
</div>
<div id="ref-2019-georgiadis">
<p>Georgiadis, Georgios. 2019. “Accelerating Convolutional Neural Networks via Activation Map Compression.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 7085–95.</p>
</div>
<div id="ref-2018-ghiasi">
<p>Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V Le. 2018. “DropBlock: A Regularization Method for Convolutional Networks.” In <em>Advances in Neural Information Processing Systems</em>, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 31:10727–37. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2018/file/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Paper.pdf">https://proceedings.neurips.cc/paper/2018/file/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Paper.pdf</a>.</p>
</div>
<div id="ref-1995-ghosh">
<p>Ghosh, Joydeep, and Kagan Tumer. 1994. “Structural Adaptation and Generalization in Supervised Feed-Forward Networks.” <em>J. Artif. Neural Netw.</em> 1 (4): 431–58.</p>
</div>
<div id="ref-2010-glorot-init">
<p>Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In <em>AISTATS</em>, edited by Yee Whye Teh and D. Mike Titterington, 9:249–56. JMLR Proceedings. JMLR.org. <a href="http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GlorotB10">http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GlorotB10</a>.</p>
</div>
<div id="ref-2011-glorot">
<p>Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011a. “Deep Sparse Rectifier Neural Networks.” In <em>Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</em>, 315–23.</p>
</div>
<div id="ref-glorot2011deep">
<p>———. 2011b. “Deep Sparse Rectifier Neural Networks.” In <em>Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</em>, 315–23.</p>
</div>
<div id="ref-2019-golub">
<p>Golub, Maximilian, Guy Lemieux, and Mieszko Lis. 2019. “Full Deep Neural Network Training on a Pruned Weight Budget.” <a href="http://arxiv.org/abs/1806.06949">http://arxiv.org/abs/1806.06949</a>.</p>
</div>
<div id="ref-2019-gomez">
<p>Gomez, Aidan N., Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. “Learning Sparse Networks Using Targeted Dropout.” <a href="http://arxiv.org/abs/1905.13678">http://arxiv.org/abs/1905.13678</a>.</p>
</div>
<div id="ref-2020-gondimalla">
<p>Gondimalla, Ashish, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. “SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks.” In <em>Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture</em>, 151–65. MICRO ’52. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3352460.3358291">https://doi.org/10.1145/3352460.3358291</a>.</p>
</div>
<div id="ref-goodfellow2014generative">
<p>Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” <a href="http://arxiv.org/abs/1406.2661">http://arxiv.org/abs/1406.2661</a>.</p>
</div>
<div id="ref-2014-goodfellow">
<p>Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In <em>Advances in Neural Information Processing Systems</em>, 2672–80. <a href="http://arxiv.org/abs/1406.2661">http://arxiv.org/abs/1406.2661</a>.</p>
</div>
<div id="ref-gopalakrishnan2018combating">
<p>Gopalakrishnan, Soorya, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. 2018. “Combating Adversarial Attacks Using Sparse Representations.” <a href="http://arxiv.org/abs/1803.03880">http://arxiv.org/abs/1803.03880</a>.</p>
</div>
<div id="ref-2018-gordon">
<p>Gordon, Ariel, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. 2018. “Morphnet: Fast &amp; Simple Resource-Constrained Structure Learning of Deep Networks.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 1586–95.</p>
</div>
<div id="ref-2020-gordon">
<p>Gordon, Mitchell A., Kevin Duh, and Nicholas Andrews. 2020. “Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.” In <em>Proceedings of the 5th Workshop on Representation Learning for Nlp</em>, 143–55. <a href="http://arxiv.org/abs/2002.08307">http://arxiv.org/abs/2002.08307</a>.</p>
</div>
<div id="ref-2020-groenquist">
<p>Grönquist, Peter, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoefler. 2020. “Deep Learning for Post-Processing Ensemble Weather Forecasts.” <a href="http://arxiv.org/abs/2005.08748">http://arxiv.org/abs/2005.08748</a>.</p>
</div>
<div id="ref-UsingAdvancedMPI">
<p>Gropp, William, Torsten Hoefler, Rajeev Thakur, and E. Lusk. 2014. <em>Using Advanced MPI: Modern Features of the Message-Passing Interface</em>. Cambridge, MA: MIT Press.</p>
</div>
<div id="ref-gropp-datatype-performance">
<p>Gropp, William, Torsten Hoefler, Rajeev Thakur, and Jesper Larsson Träff. 2011. “Performance Expectations and Guidelines for MPI Derived Datatypes.” In <em>Recent Advances in the Message Passing Interface (Eurompi’11)</em>, 6960:150–59. Santorini, Greece: Springer.</p>
</div>
<div id="ref-mdl">
<p>Grunwald, Peter. 2004. “A Tutorial Introduction to the Minimum Description Length Principle.” <a href="http://arxiv.org/abs/math/0406077">http://arxiv.org/abs/math/0406077</a>.</p>
</div>
<div id="ref-2007-grunwald">
<p>Grünwald, Peter D. 2007. <em>The Minimum Description Length Principle</em>. MIT press.</p>
</div>
<div id="ref-grunwald2007minimum">
<p>Grünwald, Peter D, and Abhijit Grunwald. 2007. <em>The Minimum Description Length Principle</em>. MIT press.</p>
</div>
<div id="ref-2018-gudovskiy">
<p>Gudovskiy, Denis, Alec Hodgkinson, and Luca Rigazio. 2018. “DNN Feature Map Compression Using Learned Representation over Gf (2).” In <em>Proceedings of the European Conference on Computer Vision (Eccv)</em>, 0–0.</p>
</div>
<div id="ref-2020-guerra">
<p>Guerra, Luis, Bohan Zhuang, Ian Reid, and Tom Drummond. 2020. “Automatic Pruning for Quantized Neural Networks.” <a href="http://arxiv.org/abs/2002.00523">http://arxiv.org/abs/2002.00523</a>.</p>
</div>
<div id="ref-2020-guo">
<p>Guo, Demi, Alexander M. Rush, and Yoon Kim. 2020. “Parameter-Efficient Transfer Learning with Diff Pruning.” <a href="http://arxiv.org/abs/2012.07463">http://arxiv.org/abs/2012.07463</a>.</p>
</div>
<div id="ref-2019-guo">
<p>Guo, Fu-Ming, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019. “Reweighted Proximal Pruning for Large-Scale Language Representation.” <a href="http://arxiv.org/abs/1909.12486">http://arxiv.org/abs/1909.12486</a>.</p>
</div>
<div id="ref-guo2019startransformer">
<p>Guo, Qipeng, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. “Star-Transformer.” In <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</em>, 1315–25. <a href="http://arxiv.org/abs/1902.09113">http://arxiv.org/abs/1902.09113</a>.</p>
</div>
<div id="ref-2016-guo">
<p>Guo, Yiwen, Anbang Yao, and Yurong Chen. 2016. “Dynamic Network Surgery for Efficient Dnns.” <a href="http://arxiv.org/abs/1608.04493">http://arxiv.org/abs/1608.04493</a>.</p>
</div>
<div id="ref-guo2018sparse">
<p>Guo, Yiwen, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. “Sparse Dnns with Improved Adversarial Robustness.” In <em>Advances in Neural Information Processing Systems</em>, 242–51.</p>
</div>
<div id="ref-2020-gupta">
<p>Gupta, Manish, and Puneet Agrawal. 2020. “Compression of Deep Learning Models for Text: A Survey.” <a href="http://arxiv.org/abs/2008.05221">http://arxiv.org/abs/2008.05221</a>.</p>
</div>
<div id="ref-2019-gupta">
<p>Gupta, Udit, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, and David Brooks. 2019. “MASR: A Modular Accelerator for Sparse Rnns.” <a href="http://arxiv.org/abs/1908.08976">http://arxiv.org/abs/1908.08976</a>.</p>
</div>
<div id="ref-1993-hagiwara">
<p>Hagiwara, Masafumi. 1993. “Removal of Hidden Units and Weights for Back Propagation Networks.” In <em>Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan)</em>, 1:351–54. IEEE.</p>
</div>
<div id="ref-1994-hagiwara">
<p>———. 1994. “A Simple and Effective Method for Removal of Hidden Units and Weights.” <em>Neurocomputing</em> 6 (2): 207–18. <a href="https://doi.org/https://doi.org/10.1016/0925-2312(94)90055-8">https://doi.org/https://doi.org/10.1016/0925-2312(94)90055-8</a>.</p>
</div>
<div id="ref-2012-han">
<p>Han, Hong-Gui, and Jun-Fei Qiao. 2013. “A Structure Optimisation Algorithm for Feedforward Neural Network Construction.” <em>Neurocomputing</em> 99: 347–57.</p>
</div>
<div id="ref-2016-han-ese">
<p>Han, Song, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, et al. 2017. “ESE: Efficient Speech Recognition Engine with Sparse Lstm on Fpga.” <a href="http://arxiv.org/abs/1612.00694">http://arxiv.org/abs/1612.00694</a>.</p>
</div>
<div id="ref-2016-han-eie">
<p>Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.” <a href="http://arxiv.org/abs/1602.01528">http://arxiv.org/abs/1602.01528</a>.</p>
</div>
<div id="ref-2015-han">
<p>Han, Song, Huizi Mao, and William J. Dally. 2016. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” <a href="http://arxiv.org/abs/1510.00149">http://arxiv.org/abs/1510.00149</a>.</p>
</div>
<div id="ref-2017-han">
<p>Han, Song, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, et al. 2017. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks.” <a href="http://arxiv.org/abs/1607.04381">http://arxiv.org/abs/1607.04381</a>.</p>
</div>
<div id="ref-2015-han-learning">
<p>Han, Song, Jeff Pool, John Tran, and William Dally. 2015. “Learning Both Weights and Connections for Efficient Neural Network.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:1135–43. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf">https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf</a>.</p>
</div>
<div id="ref-1994-hansen">
<p>Hansen, Lars Kai, and others. 1994. “Controlled Growth of Cascade Correlation Nets.” In <em>International Conference on Artificial Neural Networks</em>, 797–800. Springer.</p>
</div>
<div id="ref-NIPS1988_1c9ac015">
<p>Hanson, Stephen, and Lorien Pratt. 1989. “Comparing Biases for Minimal Network Construction with Back-Propagation.” In <em>Advances in Neural Information Processing Systems</em>, edited by D. Touretzky, 1:177–85. Morgan-Kaufmann. <a href="https://proceedings.neurips.cc/paper/1988/file/1c9ac0159c94d8d0cbedc973445af2da-Paper.pdf">https://proceedings.neurips.cc/paper/1988/file/1c9ac0159c94d8d0cbedc973445af2da-Paper.pdf</a>.</p>
</div>
<div id="ref-1992-hassibi">
<p>Hassibi, Babak, and David G. Stork. 1992. “Second Order Derivatives for Network Pruning: Optimal Brain Surgeon.” In <em>Advances in Neural Information Processing Systems 5, [Nips Conference]</em>, 164–71. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2017-hawkins">
<p>Hawkins, J. 2017. “Special Report : Can We Copy the Brain? - What Intelligent Machines Need to Learn from the Neocortex.” <em>IEEE Spectrum</em> 54 (6): 34–71. <a href="https://doi.org/10.1109/MSPEC.2017.7934229">https://doi.org/10.1109/MSPEC.2017.7934229</a>.</p>
</div>
<div id="ref-2020-hayou">
<p>Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. “Pruning Untrained Neural Networks: Principles and Analysis.” <a href="http://arxiv.org/abs/2002.08797">http://arxiv.org/abs/2002.08797</a>.</p>
</div>
<div id="ref-2015-he-init">
<p>He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” <a href="http://arxiv.org/abs/1502.01852">http://arxiv.org/abs/1502.01852</a>.</p>
</div>
<div id="ref-2017-he-mask">
<p>He, K., G. Gkioxari, P. Dollár, and R. Girshick. 2017. “Mask R-Cnn.” In <em>2017 Ieee International Conference on Computer Vision (Iccv)</em>, 2980–8. <a href="https://doi.org/10.1109/ICCV.2017.322">https://doi.org/10.1109/ICCV.2017.322</a>.</p>
</div>
<div id="ref-2016-he">
<p>He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” In <em>IEEE Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 770–78.</p>
</div>
<div id="ref-2019-he-fpgm">
<p>He, Yang, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. “Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration.” <a href="http://arxiv.org/abs/1811.00250">http://arxiv.org/abs/1811.00250</a>.</p>
</div>
<div id="ref-2019-he">
<p>He, Yihui, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2019. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices.” <a href="http://arxiv.org/abs/1802.03494">http://arxiv.org/abs/1802.03494</a>.</p>
</div>
<div id="ref-2017-he">
<p>He, Yihui, Xiangyu Zhang, and Jian Sun. 2017. “Channel Pruning for Accelerating Very Deep Neural Networks.” <a href="http://arxiv.org/abs/1707.06168">http://arxiv.org/abs/1707.06168</a>.</p>
</div>
<div id="ref-hebb-organization-of-behavior-1949">
<p>Hebb, Donald O. 1949. <em>The Organization of Behavior: A Neuropsychological Theory</em>. New York: Hardcover; Wiley.</p>
</div>
<div id="ref-2020-hegde">
<p>Hegde, Kartik, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. “ExTensor: An Accelerator for Sparse Tensor Algebra.” In <em>Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture</em>, 319–33. MICRO ’52. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3352460.3358275">https://doi.org/10.1145/3352460.3358275</a>.</p>
</div>
<div id="ref-2019-hendrycks-imagenetc">
<p>Hendrycks, Dan, and Thomas Dietterich. 2019. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.” In <em>Proceedings of the Seventh International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1903.12261">http://arxiv.org/abs/1903.12261</a>.</p>
</div>
<div id="ref-2019-hendrycks-imageneta">
<p>Hendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. “Natural Adversarial Examples.” <a href="http://arxiv.org/abs/1907.07174">http://arxiv.org/abs/1907.07174</a>.</p>
</div>
<div id="ref-Herculano-Houzel19008">
<p>Herculano-Houzel, Suzana, Bruno Mota, Peiyan Wong, and Jon H. Kaas. 2010. “Connectivity-Driven White Matter Scaling and Folding in Primate Cerebral Cortex.” <em>Proceedings of the National Academy of Sciences</em> 107 (44): 19008–13. <a href="https://doi.org/10.1073/pnas.1012590107">https://doi.org/10.1073/pnas.1012590107</a>.</p>
</div>
<div id="ref-2017-hill">
<p>Hill, P., A. Jain, M. Hill, B. Zamirai, C. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars. 2017. “DeftNN: Addressing Bottlenecks for Dnn Execution on Gpus via Synapse Vector Elimination and Near-Compute Data Fission.” In <em>2017 50th Annual Ieee/Acm International Symposium on Microarchitecture (Micro)</em>, 786–99.</p>
</div>
<div id="ref-2012-hinton">
<p>Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” <a href="http://arxiv.org/abs/1207.0580">http://arxiv.org/abs/1207.0580</a>.</p>
</div>
<div id="ref-1993-hinton">
<p>Hinton, Geoffrey E, and Drew Van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In <em>Proceedings of the Sixth Annual Conference on Computational Learning Theory</em>, 5–13.</p>
</div>
<div id="ref-hinton2015distilling">
<p>Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” <a href="http://arxiv.org/abs/1503.02531">http://arxiv.org/abs/1503.02531</a>.</p>
</div>
<div id="ref-benchmarking">
<p>Hoefler, Torsten, and Roberto Belli. 2015. “Scientific Benchmarking of Parallel Computing Systems.” In, 73:1–73:12. Austin, TX, USA: ACM.</p>
</div>
<div id="ref-2019-hooker">
<p>Hooker, Sara, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. 2019. “What Do Compressed Deep Neural Networks Forget?” <a href="http://arxiv.org/abs/1911.05248">http://arxiv.org/abs/1911.05248</a>.</p>
</div>
<div id="ref-2020-hooker">
<p>Hooker, Sara, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. “Characterising Bias in Compressed Models.” <a href="http://arxiv.org/abs/2010.03058">http://arxiv.org/abs/2010.03058</a>.</p>
</div>
<div id="ref-howard2017mobilenets">
<p>Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” <a href="http://arxiv.org/abs/1704.04861">http://arxiv.org/abs/1704.04861</a>.</p>
</div>
<div id="ref-hoyer2004non">
<p>Hoyer, Patrik O. 2004. “Non-Negative Matrix Factorization with Sparseness Constraints.” <em>Journal of Machine Learning Research</em> 5 (Nov): 1457–69.</p>
</div>
<div id="ref-2016-hu">
<p>Hu, Hengyuan, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. “Network Trimming: A Data-Driven Neuron Pruning Approach Towards Efficient Deep Architectures.” <a href="http://arxiv.org/abs/1607.03250">http://arxiv.org/abs/1607.03250</a>.</p>
</div>
<div id="ref-2020-hu">
<p>Hu, Yuwei, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020. “FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems.” In <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>. SC ’20. Atlanta, Georgia: IEEE Press.</p>
</div>
<div id="ref-2016-huang">
<p>Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In <em>Computer Vision – Eccv 2016</em>, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Cham: Springer International Publishing.</p>
</div>
<div id="ref-2018-huang">
<p>Huang, Zehao, and Naiyan Wang. 2018. “Data-Driven Sparse Structure Selection for Deep Neural Networks.” <a href="http://arxiv.org/abs/1707.01213">http://arxiv.org/abs/1707.01213</a>.</p>
</div>
<div id="ref-2019-huang">
<p>Huang, Ziyue, Wang Yilei, Ke Yi, and others. 2019. “Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation.” In <em>Advances in Neural Information Processing Systems</em>, 6371–81.</p>
</div>
<div id="ref-hubara-bin">
<p>Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. “Binarized Neural Networks.” In <em>Proceedings of the 30th International Conference on Neural Information Processing Systems</em>, 4114–22. NIPS’16. Red Hook, NY, USA: Curran Associates Inc.</p>
</div>
<div id="ref-iandola2016squeezenet">
<p>Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and &lt;0.5MB Model Size.” <a href="http://arxiv.org/abs/1602.07360">http://arxiv.org/abs/1602.07360</a>.</p>
</div>
<div id="ref-2015-ioffe">
<p>Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” <a href="http://arxiv.org/abs/1502.03167">http://arxiv.org/abs/1502.03167</a>.</p>
</div>
<div id="ref-ivanov2020data">
<p>Ivanov, Andrei, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2020. “Data Movement Is All You Need: A Case Study on Optimizing Transformers.” <a href="http://arxiv.org/abs/2007.00072">http://arxiv.org/abs/2007.00072</a>.</p>
</div>
<div id="ref-2019-ivkin">
<p>Ivkin, Nikita, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, and others. 2019. “Communication-Efficient Distributed SGD with Sketching.” In <em>Advances in Neural Information Processing Systems</em>, 13144–54. <a href="http://arxiv.org/abs/1903.04488">http://arxiv.org/abs/1903.04488</a>.</p>
</div>
<div id="ref-jacobs1991adaptive">
<p>Jacobs, Robert A, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. “Adaptive Mixtures of Local Experts.” <em>Neural Computation</em> 3 (1): 79–87.</p>
</div>
<div id="ref-jan2019iwslt">
<p>Jan, Niehues, Roldano Cattoni, Stuker Sebastian, Matteo Negri, Marco Turchi, Salesky Elizabeth, Sanabria Ramon, Barrault Loic, Specia Lucia, and Marcello Federico. 2019. “The Iwslt 2019 Evaluation Campaign.” In <em>16th International Workshop on Spoken Language Translation 2019</em>.</p>
</div>
<div id="ref-1989-janowsky">
<p>Janowsky, Steven A. 1989. “Pruning Versus Clipping in Neural Networks.” <em>Physical Review A</em> 39 (12): 6600.</p>
</div>
<div id="ref-2020-jayakumar">
<p>Jayakumar, Siddhant, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. 2020. “Top-Kast: Top-K Always Sparse Training.” <em>Advances in Neural Information Processing Systems</em> 33.</p>
</div>
<div id="ref-2018-jiang">
<p>Jiang, Peng, and Gagan Agrawal. 2018. “A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication.” In <em>Advances in Neural Information Processing Systems</em>, 2525–36.</p>
</div>
<div id="ref-2019-jin">
<p>Jin, Sian, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. “DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression.” In <em>Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing</em>, 159–70. HPDC ’19. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3307681.3326608">https://doi.org/10.1145/3307681.3326608</a>.</p>
</div>
<div id="ref-2016-jin">
<p>Jin, Xiaojie, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. 2016. “Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods.” <a href="http://arxiv.org/abs/1607.05423">http://arxiv.org/abs/1607.05423</a>.</p>
</div>
<div id="ref-jones2006cognitive">
<p>Jones, Sari, Lars Nyberg, Johan Sandblom, Anna Stigsdotter Neely, Martin Ingvar, Karl Magnus Petersson, and Lars Bäckman. 2006. “Cognitive and Neural Plasticity in Aging: General and Task-Specific Limitations.” <em>Neuroscience &amp; Biobehavioral Reviews</em> 30 (6): 864–71.</p>
</div>
<div id="ref-jordan1994hierarchical">
<p>Jordan, Michael I, and Robert A Jacobs. 1994. “Hierarchical Mixtures of Experts and the Em Algorithm.” <em>Neural Computation</em> 6 (2): 181–214.</p>
</div>
<div id="ref-2020-jorge">
<p>Jorge, Pau de, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, and Puneet K. Dokania. 2020. “Progressive Skeletonization: Trimming More Fat from a Network at Initialization.” <a href="http://arxiv.org/abs/2006.09081">http://arxiv.org/abs/2006.09081</a>.</p>
</div>
<div id="ref-kalchbrenner2018efficient">
<p>Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” <a href="http://arxiv.org/abs/1802.08435">http://arxiv.org/abs/1802.08435</a>.</p>
</div>
<div id="ref-1991-kameyama">
<p>Kameyama, K., and Y. Kosugi. 1991. “Automatic Fusion and Splitting of Artificial Neural Elements in Optimizing the Network Size.” In <em>Conference Proceedings 1991 Ieee International Conference on Systems, Man, and Cybernetics</em>, 1633–8 vol.3. <a href="https://doi.org/10.1109/ICSMC.1991.169926">https://doi.org/10.1109/ICSMC.1991.169926</a>.</p>
</div>
<div id="ref-2020-kang">
<p>Kang, Minsoo, and Bohyung Han. 2020. “Operation-Aware Soft Channel Pruning Using Differentiable Masks.” <a href="http://arxiv.org/abs/2007.03938">http://arxiv.org/abs/2007.03938</a>.</p>
</div>
<div id="ref-1993-kanjilal">
<p>Kanjilal, P. P., P. K. Dey, and D. N. Banerjee. 1993. “Reduced-Size Neural Networks Through Singular Value Decomposition and Subset Selection.” <em>Electronics Letters</em> 29 (17): 1516–8. <a href="https://doi.org/10.1049/el:19931010">https://doi.org/10.1049/el:19931010</a>.</p>
</div>
<div id="ref-kaplan2020scaling">
<p>Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” <a href="http://arxiv.org/abs/2001.08361">http://arxiv.org/abs/2001.08361</a>.</p>
</div>
<div id="ref-2019-karimireddy">
<p>Karimireddy, Sai Praneeth, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. 2019. “Error Feedback Fixes SignSGD and Other Gradient Compression Schemes.” In <em>Proceedings of the Thirty-Sixth International Conference on Machine Learning</em>, 3252–61. <a href="http://arxiv.org/abs/1901.09847">http://arxiv.org/abs/1901.09847</a>.</p>
</div>
<div id="ref-1990-karnin">
<p>Karnin, E. D. 1990. “A Simple Procedure for Pruning Back-Propagation Trained Neural Networks.” <em>IEEE Transactions on Neural Networks</em> 1 (2): 239–42. <a href="https://doi.org/10.1109/72.80236">https://doi.org/10.1109/72.80236</a>.</p>
</div>
<div id="ref-Kerr14063">
<p>Kerr, Jason N. D., David Greenberg, and Fritjof Helmchen. 2005. “Imaging Input and Output of Neocortical Networks in Vivo.” <em>Proceedings of the National Academy of Sciences</em> 102 (39): 14063–8. <a href="https://doi.org/10.1073/pnas.0506029102">https://doi.org/10.1073/pnas.0506029102</a>.</p>
</div>
<div id="ref-2017-kim">
<p>Kim, D., J. Ahn, and S. Yoo. 2018. “ZeNA: Zero-Aware Neural Network Accelerator.” <em>IEEE Design Test</em> 35 (1): 39–46. <a href="https://doi.org/10.1109/MDAT.2017.2741463">https://doi.org/10.1109/MDAT.2017.2741463</a>.</p>
</div>
<div id="ref-2015-kingma">
<p>Kingma, Diederik P, Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:2575–83. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf">https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf</a>.</p>
</div>
<div id="ref-2013-kingma">
<p>Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” <a href="http://arxiv.org/abs/1312.6114">http://arxiv.org/abs/1312.6114</a>.</p>
</div>
<div id="ref-2019-kodryan">
<p>Kodryan, Maxim, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. 2019. “Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks.” In <em>Proceedings of the 4th Workshop on Representation Learning for Nlp (Repl4nlp-2019)</em>, 40–48.</p>
</div>
<div id="ref-2018-konecny">
<p>Konečnỳ, Jakub, and Peter Richtárik. 2018. “Randomized Distributed Mean Estimation: Accuracy Vs. Communication.” <em>Frontiers in Applied Mathematics and Statistics</em> 4: 62. <a href="http://arxiv.org/abs/1611.07555">http://arxiv.org/abs/1611.07555</a>.</p>
</div>
<div id="ref-10.5555/2986916.2987033">
<p>Krogh, Anders, and John A. Hertz. 1991. “A Simple Weight Decay Can Improve Generalization.” In <em>Proceedings of the 4th International Conference on Neural Information Processing Systems</em>, 950–57. NIPS’91. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2017-krueger">
<p>Krueger, David, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. 2017. “Zoneout: Regularizing Rnns by Randomly Preserving Hidden Activations.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2018-kung">
<p>Kung, H. T., Bradley McDanel, and Sai Qian Zhang. 2018. “Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization.” <a href="http://arxiv.org/abs/1811.04770">http://arxiv.org/abs/1811.04770</a>.</p>
</div>
<div id="ref-2019-kunstner">
<p>Kunstner, Frederik, Philipp Hennig, and Lukas Balles. 2019. “Limitations of the Empirical Fisher Approximation for Natural Gradient Descent.” In <em>Advances in Neural Information Processing Systems</em>, 4156–67.</p>
</div>
<div id="ref-2020-kurtz">
<p>Kurtz, Mark, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. 2020. “Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks.” In <em>International Conference on Machine Learning</em>, 5533–43. PMLR.</p>
</div>
<div id="ref-2020-kusupati">
<p>Kusupati, Aditya, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. 2020. “Soft Threshold Weight Reparameterization for Learnable Sparsity.” <a href="http://arxiv.org/abs/2002.03231">http://arxiv.org/abs/2002.03231</a>.</p>
</div>
<div id="ref-2019-kuzmin">
<p>Kuzmin, Andrey, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. 2019. “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1912.09802">http://arxiv.org/abs/1912.09802</a>.</p>
</div>
<div id="ref-kwiatkowski2019natural">
<p>Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, et al. 2019. “Natural Questions: A Benchmark for Question Answering Research.” <em>Transactions of the Association for Computational Linguistics</em> 7: 453–66.</p>
</div>
<div id="ref-2019-lample">
<p>Lample, Guillaume, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. “Large Memory Layers with Product Keys.” <a href="http://arxiv.org/abs/1907.05242">http://arxiv.org/abs/1907.05242</a>.</p>
</div>
<div id="ref-2017-larsson">
<p>Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. 2017. “FractalNet: Ultra-Deep Neural Networks Without Residuals.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2006-lauret">
<p>Lauret, Philippe, Eric Fock, and Thierry Alex Mara. 2006. “A Node Pruning Algorithm Based on a Fourier Amplitude Sensitivity Test Method.” <em>IEEE Transactions on Neural Networks</em> 17 (2): 273–93.</p>
</div>
<div id="ref-7780804">
<p>Lavin, A., and S. Gray. 2016. “Fast Algorithms for Convolutional Neural Networks.” In <em>2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 4013–21. <a href="https://doi.org/10.1109/CVPR.2016.435">https://doi.org/10.1109/CVPR.2016.435</a>.</p>
</div>
<div id="ref-2015-lebedev">
<p>Lebedev, Vadim, and Victor Lempitsky. 2015. “Fast Convnets Using Group-Wise Brain Damage.” <a href="http://arxiv.org/abs/1506.02515">http://arxiv.org/abs/1506.02515</a>.</p>
</div>
<div id="ref-1990-lecun">
<p>Le Cun, Yann, John S. Denker, and Sara A. Solla. 1990. “Optimal Brain Damage.” In <em>Advances in Neural Information Processing Systems 2</em>, 598–605. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2019-lee-init">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. 2020. “A Signal Propagation Perspective for Pruning Neural Networks at Initialization.” <a href="http://arxiv.org/abs/1906.06307">http://arxiv.org/abs/1906.06307</a>.</p>
</div>
<div id="ref-2019-lee">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity.” <a href="http://arxiv.org/abs/1810.02340">http://arxiv.org/abs/1810.02340</a>.</p>
</div>
<div id="ref-2020-lee">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, Philip H. S. Torr, and Martin Jaggi. 2020. “Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training.” <a href="http://arxiv.org/abs/2003.11316">http://arxiv.org/abs/2003.11316</a>.</p>
</div>
<div id="ref-lepikhin2020gshard">
<p>Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” <a href="http://arxiv.org/abs/2006.16668">http://arxiv.org/abs/2006.16668</a>.</p>
</div>
<div id="ref-2017-li">
<p>Li, Hao, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. “Pruning Filters for Efficient Convnets.” <a href="http://arxiv.org/abs/1608.08710">http://arxiv.org/abs/1608.08710</a>.</p>
</div>
<div id="ref-2019-li">
<p>Li, J., S. Jiang, S. Gong, J. Wu, J. Yan, G. Yan, and X. Li. 2019. “SqueezeFlow: A Sparse Cnn Accelerator Exploiting Concise Convolution Rules.” <em>IEEE Transactions on Computers</em> 68 (11): 1663–77. <a href="https://doi.org/10.1109/TC.2019.2924215">https://doi.org/10.1109/TC.2019.2924215</a>.</p>
</div>
<div id="ref-2020-li-sac">
<p>Li, Xiaoya, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020. “SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection.” <a href="http://arxiv.org/abs/2003.09833">http://arxiv.org/abs/2003.09833</a>.</p>
</div>
<div id="ref-li2020explaining">
<p>Li, Yuanzhi, Colin Wei, and Tengyu Ma. 2020. “Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks.” <a href="http://arxiv.org/abs/1907.04595">http://arxiv.org/abs/1907.04595</a>.</p>
</div>
<div id="ref-2020-li-bits">
<p>Li, Yunqiang, Silvia Laura Pintea, and Jan van Gemert. 2021. “Less Bits Is More: How Pruning Deep Binary Networks Increases Weight Capacity.” <a href="https://openreview.net/forum?id=Hy8JM_Fvt5N">https://openreview.net/forum?id=Hy8JM_Fvt5N</a>.</p>
</div>
<div id="ref-2020-li">
<p>Li, Zhuohan, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020. “Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” <a href="http://arxiv.org/abs/2002.11794">http://arxiv.org/abs/2002.11794</a>.</p>
</div>
<div id="ref-2019-lieberwein">
<p>Liebenwein, Lucas, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. 2020. “Provable Filter Pruning for Efficient Neural Networks.” <a href="http://arxiv.org/abs/1911.07412">http://arxiv.org/abs/1911.07412</a>.</p>
</div>
<div id="ref-lillicrap2019continuous">
<p>Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2019. “Continuous Control with Deep Reinforcement Learning.” <a href="http://arxiv.org/abs/1509.02971">http://arxiv.org/abs/1509.02971</a>.</p>
</div>
<div id="ref-lillicrap2020backpropagation">
<p>Lillicrap, Timothy P, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. “Backpropagation and the Brain.” <em>Nature Reviews Neuroscience</em>, 1–12.</p>
</div>
<div id="ref-2019-lim">
<p>Lim, Hyeontaek, David Andersen, and Michael Kaminsky. 2019. “3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning.” In <em>Proceedings of the Conference on Systems and Machine Learning</em>. <a href="http://arxiv.org/abs/1802.07389">http://arxiv.org/abs/1802.07389</a>.</p>
</div>
<div id="ref-2017-lin">
<p>Lin, Ji, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. “Runtime Neural Pruning.” In <em>Advances in Neural Information Processing Systems</em>, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:2181–91. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf</a>.</p>
</div>
<div id="ref-2020-lin">
<p>Lin, Tao, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. “Dynamic Model Pruning with Feedback.” <a href="http://arxiv.org/abs/2006.07253">http://arxiv.org/abs/2006.07253</a>.</p>
</div>
<div id="ref-2018-lin">
<p>Lin, Yujun, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2018. “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.” In <em>Proceedings of the Sixth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1712.01887">http://arxiv.org/abs/1712.01887</a>.</p>
</div>
<div id="ref-2020-lin-tformer">
<p>Lin, Zi, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. 2020. “Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior.” In <em>Findings of the Association for Computational Linguistics: EMNLP 2020</em>, 719–30. <a href="http://arxiv.org/abs/2010.01791">http://arxiv.org/abs/2010.01791</a>.</p>
</div>
<div id="ref-lison2019open">
<p>Lison, Pierre, Jörg Tiedemann, Milen Kouylekov, and others. 2019. “Open Subtitles 2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.” In <em>LREC 2018, Eleventh International Conference on Language Resources and Evaluation</em>. European Language Resources Association (ELRA).</p>
</div>
<div id="ref-2015-liu">
<p>Liu, Baoyuan, Min Wang, H. Foroosh, M. Tappen, and M. Penksy. 2015. “Sparse Convolutional Neural Networks.” In <em>2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 806–14. <a href="https://doi.org/10.1109/CVPR.2015.7298681">https://doi.org/10.1109/CVPR.2015.7298681</a>.</p>
</div>
<div id="ref-liu2018dynamic">
<p>Liu, Lanlan, and Jia Deng. 2018. “Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution.” <a href="http://arxiv.org/abs/1701.00299">http://arxiv.org/abs/1701.00299</a>.</p>
</div>
<div id="ref-2019-liu-dynamic">
<p>Liu, Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. 2019. “Dynamic Sparse Graph for Efficient Deep Learning.” <a href="http://arxiv.org/abs/1810.00859">http://arxiv.org/abs/1810.00859</a>.</p>
</div>
<div id="ref-2020-liu">
<p>Liu, Tianlin, and Friedemann Zenke. 2020. “Finding Trainable Sparse Networks Through Neural Tangent Transfer.” <a href="http://arxiv.org/abs/2006.08228">http://arxiv.org/abs/2006.08228</a>.</p>
</div>
<div id="ref-2018-liu-winograd">
<p>Liu, Xingyu, Jeff Pool, Song Han, and William J. Dally. 2018. “Efficient Sparse-Winograd Convolutional Neural Networks.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2019-liu">
<p>Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” <a href="http://arxiv.org/abs/1907.11692">http://arxiv.org/abs/1907.11692</a>.</p>
</div>
<div id="ref-2017-liu">
<p>Liu, Zhuang, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. “Learning Efficient Convolutional Networks Through Network Slimming.” <a href="http://arxiv.org/abs/1708.06519">http://arxiv.org/abs/1708.06519</a>.</p>
</div>
<div id="ref-2018-liu">
<p>Liu, Zhuang, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019. “Rethinking the Value of Network Pruning.” <a href="http://arxiv.org/abs/1810.05270">http://arxiv.org/abs/1810.05270</a>.</p>
</div>
<div id="ref-2015-liu-celeba">
<p>Liu, Ziwei, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. “Deep Learning Face Attributes in the Wild.” In <em>Proceedings of the Ieee International Conference on Computer Vision</em>, 3730–8. <a href="http://arxiv.org/abs/1411.7766">http://arxiv.org/abs/1411.7766</a>.</p>
</div>
<div id="ref-2018-lobacheva">
<p>Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2018. “Bayesian Sparsification of Gated Recurrent Neural Networks.” <a href="http://arxiv.org/abs/1812.05692">http://arxiv.org/abs/1812.05692</a>.</p>
</div>
<div id="ref-2019-loshchilov">
<p>Loshchilov, Ilya, and Frank Hutter. 2019. “Decoupled Weight Decay Regularization.” In <em>Proceedings of the Seventh International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1711.05101">http://arxiv.org/abs/1711.05101</a>.</p>
</div>
<div id="ref-2017-louizos-bayes">
<p>Louizos, Christos, Karen Ullrich, and Max Welling. 2017. “Bayesian Compression for Deep Learning.” <a href="http://arxiv.org/abs/1705.08665">http://arxiv.org/abs/1705.08665</a>.</p>
</div>
<div id="ref-2018-louizos">
<p>Louizos, Christos, Max Welling, and Diederik P. Kingma. 2018. “Learning Sparse Neural Networks Through <span class="math inline"><em>L</em><sub>0</sub></span> Regularization.” <a href="http://arxiv.org/abs/1712.01312">http://arxiv.org/abs/1712.01312</a>.</p>
</div>
<div id="ref-2019-luo">
<p>Luo, Jian-Hao, and Jianxin Wu. 2019. “AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference.” <a href="http://arxiv.org/abs/1805.08941">http://arxiv.org/abs/1805.08941</a>.</p>
</div>
<div id="ref-2017-luo">
<p>Luo, Jian-Hao, Jianxin Wu, and Weiyao Lin. 2017. “ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression.” <a href="http://arxiv.org/abs/1707.06342">http://arxiv.org/abs/1707.06342</a>.</p>
</div>
<div id="ref-ly2017tutorial">
<p>Ly, Alexander, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. 2017. “A Tutorial on Fisher Information.” <a href="http://arxiv.org/abs/1705.01064">http://arxiv.org/abs/1705.01064</a>.</p>
</div>
<div id="ref-2019-lym">
<p>Lym, Sangkug, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. “PruneTrain.” <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>, November. <a href="https://doi.org/10.1145/3295500.3356156">https://doi.org/10.1145/3295500.3356156</a>.</p>
</div>
<div id="ref-2020-madaan">
<p>Madaan, Divyam, Jinwoo Shin, and Sung Ju Hwang. 2020. “Adversarial Neural Pruning with Latent Vulnerability Suppression.” <a href="http://arxiv.org/abs/1908.04355">http://arxiv.org/abs/1908.04355</a>.</p>
</div>
<div id="ref-2017-maddison">
<p>Maddison, Chris J., Andriy Mnih, and Yee Whye Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-makhzani2015winnertakeall">
<p>Makhzani, Alireza, and Brendan Frey. 2015. “Winner-Take-All Autoencoders.” <a href="http://arxiv.org/abs/1409.2752">http://arxiv.org/abs/1409.2752</a>.</p>
</div>
<div id="ref-2020-malach">
<p>Malach, Eran, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. 2020. “Proving the Lottery Ticket Hypothesis: Pruning Is All You Need.” <a href="http://arxiv.org/abs/2002.00585">http://arxiv.org/abs/2002.00585</a>.</p>
</div>
<div id="ref-2018-malaviya">
<p>Malaviya, Chaitanya, Pedro Ferreira, and André FT Martins. 2018. “Sparse and Constrained Attention for Neural Machine Translation.” In <em>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>. <a href="http://arxiv.org/abs/1805.08241">http://arxiv.org/abs/1805.08241</a>.</p>
</div>
<div id="ref-2018-mallya">
<p>Mallya, Arun, and Svetlana Lazebnik. 2018. “PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning.” <a href="http://arxiv.org/abs/1711.05769">http://arxiv.org/abs/1711.05769</a>.</p>
</div>
<div id="ref-2018-manessi">
<p>Manessi, Franco, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schettini. 2018. “Automated Pruning for Deep Neural Network Compression.” <em>2018 24th International Conference on Pattern Recognition (ICPR)</em>, August. <a href="https://doi.org/10.1109/icpr.2018.8546129">https://doi.org/10.1109/icpr.2018.8546129</a>.</p>
</div>
<div id="ref-2017-mao">
<p>Mao, Huizi, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. “Exploring the Regularity of Sparse Structure in Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1705.08922">http://arxiv.org/abs/1705.08922</a>.</p>
</div>
<div id="ref-2015-mariet">
<p>Mariet, Zelda, and Suvrit Sra. 2017. “Diversity Networks: Neural Network Compression Using Determinantal Point Processes.” <a href="http://arxiv.org/abs/1511.05077">http://arxiv.org/abs/1511.05077</a>.</p>
</div>
<div id="ref-2015-martens">
<p>Martens, James, and Roger Grosse. 2015. “Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” <a href="http://arxiv.org/abs/1503.05671">http://arxiv.org/abs/1503.05671</a>.</p>
</div>
<div id="ref-2016-martins">
<p>Martins, Andre, and Ramon Astudillo. 2016. “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification.” In <em>International Conference on Machine Learning</em>, 1614–23. <a href="http://arxiv.org/abs/1602.02068">http://arxiv.org/abs/1602.02068</a>.</p>
</div>
<div id="ref-2019-mattson">
<p>Mattson, Peter, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, et al. 2020. “MLPerf Training Benchmark.” <a href="http://arxiv.org/abs/1910.01500">http://arxiv.org/abs/1910.01500</a>.</p>
</div>
<div id="ref-mccandlish2018empirical">
<p>McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. “An Empirical Model of Large-Batch Training.” <a href="http://arxiv.org/abs/1812.06162">http://arxiv.org/abs/1812.06162</a>.</p>
</div>
<div id="ref-2019-mccarley">
<p>McCarley, J. S., Rishav Chakravarti, and Avirup Sil. 2020. “Structured Pruning of a BERT-Based Question Answering Model.” <a href="http://arxiv.org/abs/1910.06360">http://arxiv.org/abs/1910.06360</a>.</p>
</div>
<div id="ref-2019-mehta">
<p>Mehta, Rahul. 2019. “Sparse Transfer Learning via Winning Lottery Tickets.” <a href="http://arxiv.org/abs/1905.07785">http://arxiv.org/abs/1905.07785</a>.</p>
</div>
<div id="ref-2020-meng">
<p>Meng, Fanxu, Hao Cheng, Ke Li, Huixiang Luo, Xiaowei Guo, Guangming Lu, and Xing Sun. 2020. “Pruning Filter in Filter.” <a href="http://arxiv.org/abs/2009.14410">http://arxiv.org/abs/2009.14410</a>.</p>
</div>
<div id="ref-mhaskar2016deep">
<p>Mhaskar, Hrushikesh, and Tomaso Poggio. 2016. “Deep Vs. Shallow Networks : An Approximation Theory Perspective.” <a href="http://arxiv.org/abs/1608.03287">http://arxiv.org/abs/1608.03287</a>.</p>
</div>
<div id="ref-2019-michel">
<p>Michel, Paul, Omer Levy, and Graham Neubig. 2019. “Are Sixteen Heads Really Better Than One?” <a href="http://arxiv.org/abs/1905.10650">http://arxiv.org/abs/1905.10650</a>.</p>
</div>
<div id="ref-millidge2020predictive">
<p>Millidge, Beren, Alexander Tschantz, and Christopher L. Buckley. 2020. “Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs.” <a href="http://arxiv.org/abs/2006.04182">http://arxiv.org/abs/2006.04182</a>.</p>
</div>
<div id="ref-2017-mishra">
<p>Mishra, Asit K., Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. “WRPN: Wide Reduced-Precision Networks.” <em>CoRR</em> abs/1709.01134. <a href="http://arxiv.org/abs/1709.01134">http://arxiv.org/abs/1709.01134</a>.</p>
</div>
<div id="ref-2018-mittal">
<p>Mittal, Deepak, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. 2018. “Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1801.10447">http://arxiv.org/abs/1801.10447</a>.</p>
</div>
<div id="ref-2018-miyato">
<p>Miyato, Takeru, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. “Spectral Normalization for Generative Adversarial Networks.” In <em>Proceedings of the Sixth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1802.05957">http://arxiv.org/abs/1802.05957</a>.</p>
</div>
<div id="ref-2018-mocanu">
<p>Mocanu, Decebal Constantin, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. “Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science.” <em>Nature Communications</em> 9 (1): 1–12.</p>
</div>
<div id="ref-2016-molchanov-ard">
<p>Molchanov, Dmitry, Arseniy Ashuha, and Dmitry Vetrov. 2016. “Dropout-Based Automatic Relevance Determination.” In <em>Bayesian Deep Learning Workshop, Nips</em>.</p>
</div>
<div id="ref-2017-molchanov">
<p>Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” <a href="http://arxiv.org/abs/1701.05369">http://arxiv.org/abs/1701.05369</a>.</p>
</div>
<div id="ref-2019-molchanov">
<p>Molchanov, Pavlo, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. “Importance Estimation for Neural Network Pruning.” <a href="http://arxiv.org/abs/1906.10771">http://arxiv.org/abs/1906.10771</a>.</p>
</div>
<div id="ref-2016-molchanov">
<p>Molchanov, Pavlo, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. “Pruning Convolutional Neural Networks for Resource Efficient Inference.” <a href="http://arxiv.org/abs/1611.06440">http://arxiv.org/abs/1611.06440</a>.</p>
</div>
<div id="ref-1991-moody">
<p>Moody, John E. 1991. “Note on Generalization, Regularization and Architecture Selection in Nonlinear Learning Systems.” In <em>Neural Networks for Signal Processing Proceedings of the 1991 Ieee Workshop</em>, 1–10. IEEE.</p>
</div>
<div id="ref-2020-morcos">
<p>Morcos, Ari S., Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. “One Ticket to Win Them All: Generalizing Lottery Ticket Initializations Across Datasets and Optimizers.” <a href="http://arxiv.org/abs/1906.02773">http://arxiv.org/abs/1906.02773</a>.</p>
</div>
<div id="ref-2019-mostafa">
<p>Mostafa, Hesham, and Xin Wang. 2019. “Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization.” <a href="http://arxiv.org/abs/1902.05967">http://arxiv.org/abs/1902.05967</a>.</p>
</div>
<div id="ref-1988-mozer">
<p>Mozer, Michael C, and Paul Smolensky. 1988. “Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment.” <em>Advances in Neural Information Processing Systems</em> 1: 107–15.</p>
</div>
<div id="ref-2011-mrazova">
<p>Mrázová, I., and Z. Reitermanová. 2011. “A New Sensitivity-Based Pruning Technique for Feed-Forward Neural Networks That Improves Generalization.” In <em>The 2011 International Joint Conference on Neural Networks</em>, 2143–50. <a href="https://doi.org/10.1109/IJCNN.2011.6033493">https://doi.org/10.1109/IJCNN.2011.6033493</a>.</p>
</div>
<div id="ref-2006-mukherjee">
<p>Mukherjee, Sayan, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. “Learning Theory: Stability Is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization.” <em>Advances in Computational Mathematics</em> 25 (1-3): 161–93.</p>
</div>
<div id="ref-2020-mussay">
<p>Mussay, Ben, Daniel Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. 2020. “Data-Independent Structured Pruning of Neural Networks via Coresets.” <a href="http://arxiv.org/abs/2008.08316">http://arxiv.org/abs/2008.08316</a>.</p>
</div>
<div id="ref-2017-narang">
<p>Narang, Sharan, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. “Exploring Sparsity in Recurrent Neural Networks.” <a href="http://arxiv.org/abs/1704.05119">http://arxiv.org/abs/1704.05119</a>.</p>
</div>
<div id="ref-2008-narasimha">
<p>Narasimha, Pramod L., Walter H. Delashmit, Michael T. Manry, Jiang Li, and Francisco Maldonado. 2008. “An Integrated Growing-Pruning Method for Feedforward Network Training.” <em>Neurocomputing</em> 71 (13): 2831–47. <a href="https://doi.org/https://doi.org/10.1016/j.neucom.2007.08.026">https://doi.org/https://doi.org/10.1016/j.neucom.2007.08.026</a>.</p>
</div>
<div id="ref-2017-neklyudov">
<p>Neklyudov, Kirill, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Structured Bayesian Pruning via Log-Normal Multiplicative Noise.” <a href="http://arxiv.org/abs/1705.07283">http://arxiv.org/abs/1705.07283</a>.</p>
</div>
<div id="ref-2020-neyshabur">
<p>Neyshabur, Behnam. 2020. “Towards Learning Convolutions from Scratch.” <a href="http://arxiv.org/abs/2007.13657">http://arxiv.org/abs/2007.13657</a>.</p>
</div>
<div id="ref-neyshabur2018understanding">
<p>Neyshabur, Behnam, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. 2018. “Towards Understanding the Role of over-Parametrization in Generalization of Neural Networks.” <a href="http://arxiv.org/abs/1805.12076">http://arxiv.org/abs/1805.12076</a>.</p>
</div>
<div id="ref-2010-ngiam">
<p>Ngiam, J., Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. “Tiled Convolutional Neural Networks.” In <em>Advances in Neural Information Processing Systems 23</em>, 1279–87.</p>
</div>
<div id="ref-2017-niculae">
<p>Niculae, Vlad, and Mathieu Blondel. 2017. “A Regularized Framework for Sparse and Structured Neural Attention.” In <em>Advances in Neural Information Processing Systems</em>, 3338–48. <a href="http://arxiv.org/abs/1705.07704">http://arxiv.org/abs/1705.07704</a>.</p>
</div>
<div id="ref-2009-nilsson">
<p>Nilsson, Nils J. 2009. <em>The Quest for Artificial Intelligence: A History of Ideas and Achievements</em>. Cambridge University Press.</p>
</div>
<div id="ref-2020-niu">
<p>Niu, Yue, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. 2020. “Reuse Kernels or Activations? A Flexible Dataflow for Low-Latency Spectral Cnn Acceleration.” In <em>Proceedings of the 2020 Acm/Sigda International Symposium on Field-Programmable Gate Arrays</em>, 266–76. FPGA ’20. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3373087.3375302">https://doi.org/10.1145/3373087.3375302</a>.</p>
</div>
<div id="ref-2019-niu">
<p>Niu, Yue, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal Kannan, Yanzhi Wang, and Viktor Prasanna. 2019. “SPEC2: SPECtral Sparse Cnn Accelerator on Fpgas.” <a href="http://arxiv.org/abs/1910.11103">http://arxiv.org/abs/1910.11103</a>.</p>
</div>
<div id="ref-noh2015learning">
<p>Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. 2015. “Learning Deconvolution Network for Semantic Segmentation.” <a href="http://arxiv.org/abs/1505.04366">http://arxiv.org/abs/1505.04366</a>.</p>
</div>
<div id="ref-1992-nowlan">
<p>Nowlan, Steven J, and Geoffrey E Hinton. 1992. “Simplifying Neural Networks by Soft Weight-Sharing.” <em>Neural Computation</em> 4 (4): 473–93.</p>
</div>
<div id="ref-a100">
<p>Nvidia. 2020. “NVIDIA A100 Tensor Core Gpu Architecture.”</p>
</div>
<div id="ref-1996-olshausen">
<p>Olshausen, Bruno A, and David J Field. 1996. “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.” <em>Nature</em> 381 (6583): 607–9.</p>
</div>
<div id="ref-2020-orseau">
<p>Orseau, Laurent, Marcus Hutter, and Omar Rivasplata. 2020. “Logarithmic Pruning Is All You Need.” <a href="http://arxiv.org/abs/2006.12156">http://arxiv.org/abs/2006.12156</a>.</p>
</div>
<div id="ref-2019-osawa">
<p>Osawa, Kazuki, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. “Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks.” <em>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, June. <a href="https://doi.org/10.1109/cvpr.2019.01264">https://doi.org/10.1109/cvpr.2019.01264</a>.</p>
</div>
<div id="ref-2016-pan">
<p>Pan, Wei, Hao Dong, and Yike Guo. 2016. “DropNeuron: Simplifying the Structure of Deep Neural Networks.” <a href="http://arxiv.org/abs/1606.07326">http://arxiv.org/abs/1606.07326</a>.</p>
</div>
<div id="ref-2017-parashar">
<p>Parashar, Angshuman, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. “SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1708.04485">http://arxiv.org/abs/1708.04485</a>.</p>
</div>
<div id="ref-2016-park">
<p>Park, Jongsoo, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. “Faster Cnns with Direct Sparse Convolutions and Guided Pruning.” <a href="http://arxiv.org/abs/1608.01409">http://arxiv.org/abs/1608.01409</a>.</p>
</div>
<div id="ref-2018-parmar">
<p>Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. “Image Transformer.” In <em>International Conference on Machine Learning</em>, 4055–64. <a href="http://arxiv.org/abs/1802.05751">http://arxiv.org/abs/1802.05751</a>.</p>
</div>
<div id="ref-NIPS1995_3473decc">
<p>Pedersen, Morten, Lars Hansen, and Jan Larsen. 1996. “Pruning with Generalization Based Weight Saliencies: Lambda Obd, Lambda Obs.” In <em>Advances in Neural Information Processing Systems</em>, edited by D. Touretzky, M. C. Mozer, and M. Hasselmo, 8:521–27. MIT Press. <a href="https://proceedings.neurips.cc/paper/1995/file/3473decccb0509fb264818a7512a8b9b-Paper.pdf">https://proceedings.neurips.cc/paper/1995/file/3473decccb0509fb264818a7512a8b9b-Paper.pdf</a>.</p>
</div>
<div id="ref-2020-pensia">
<p>Pensia, Ankit, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. 2020. “Optimal Lottery Tickets via Subsetsum: Logarithmic over-Parameterization Is Sufficient.” <a href="http://arxiv.org/abs/2006.07990">http://arxiv.org/abs/2006.07990</a>.</p>
</div>
<div id="ref-plummer2020shapeshifter">
<p>Plummer, Bryan A., Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. 2020. “Shapeshifter Networks: Cross-Layer Parameter Sharing for Scalable and Effective Deep Learning.” <a href="http://arxiv.org/abs/2006.10598">http://arxiv.org/abs/2006.10598</a>.</p>
</div>
<div id="ref-2015-polyak">
<p>Polyak, A., and L. Wolf. 2015. “Channel-Level Acceleration of Deep Face Representations.” <em>IEEE Access</em> 3: 2163–75. <a href="https://doi.org/10.1109/ACCESS.2015.2494536">https://doi.org/10.1109/ACCESS.2015.2494536</a>.</p>
</div>
<div id="ref-10.1145/356616.356618">
<p>Pooch, Udo W., and Al Nieder. 1973. “A Survey of Indexing Techniques for Sparse Matrices.” <em>ACM Comput. Surv.</em> 5 (2): 109–33. <a href="https://doi.org/10.1145/356616.356618">https://doi.org/10.1145/356616.356618</a>.</p>
</div>
<div id="ref-2017-prabhu">
<p>Prabhu, Ameya, Girish Varma, and Anoop Namboodiri. 2018. “Deep Expander Networks: Efficient Deep Networks from Graph Theory.” <a href="http://arxiv.org/abs/1711.08757">http://arxiv.org/abs/1711.08757</a>.</p>
</div>
<div id="ref-2020-prasanna">
<p>Prasanna, Sai, Anna Rogers, and Anna Rumshisky. 2020. “When BERT Plays the Lottery, All Tickets Are Winning.” <a href="http://arxiv.org/abs/2005.00561">http://arxiv.org/abs/2005.00561</a>.</p>
</div>
<div id="ref-1997-prechelt">
<p>Prechelt, Lutz. 1997. “Connection Pruning with Static and Adaptive Pruning Schedules.” <em>Neurocomputing</em> 16 (1): 49–61. <a href="https://doi.org/https://doi.org/10.1016/S0925-2312(96)00054-9">https://doi.org/https://doi.org/10.1016/S0925-2312(96)00054-9</a>.</p>
</div>
<div id="ref-2020-qin">
<p>Qin, E., A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna. 2020. “SIGMA: A Sparse and Irregular Gemm Accelerator with Flexible Interconnects for Dnn Training.” In <em>2020 Ieee International Symposium on High Performance Computer Architecture (Hpca)</em>, 58–70. <a href="https://doi.org/10.1109/HPCA47549.2020.00015">https://doi.org/10.1109/HPCA47549.2020.00015</a>.</p>
</div>
<div id="ref-2020-raihan">
<p>Raihan, Md Aamir, and Tor M. Aamodt. 2020. “Sparse Weight Activation Training.” <a href="http://arxiv.org/abs/2001.01969">http://arxiv.org/abs/2001.01969</a>.</p>
</div>
<div id="ref-10.1145/3386263.3407651">
<p>Rakin, Adnan Siraj, Zhezhi He, Li Yang, Yanzhi Wang, Liqiang Wang, and Deliang Fan. 2020. “Robust Sparse Regularization: Defending Adversarial Attacks via Regularized Sparse Network.” In <em>Proceedings of the 2020 on Great Lakes Symposium on Vlsi</em>, 125–30. GLSVLSI ’20. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3386263.3407651">https://doi.org/10.1145/3386263.3407651</a>.</p>
</div>
<div id="ref-ramanujan2020whats">
<p>Ramanujan, Vivek, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. “What’s Hidden in a Randomly Weighted Neural Network?” <a href="http://arxiv.org/abs/1911.13299">http://arxiv.org/abs/1911.13299</a>.</p>
</div>
<div id="ref-rasmussen2001occam">
<p>Rasmussen, Carl Edward, and Zoubin Ghahramani. 2001. “Occam’s Razor.” In <em>Advances in Neural Information Processing Systems</em>, 294–300.</p>
</div>
<div id="ref-2016-reagen">
<p>Reagen, B., P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G. Wei, and D. Brooks. 2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In <em>2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca)</em>, 267–78. <a href="https://doi.org/10.1109/ISCA.2016.32">https://doi.org/10.1109/ISCA.2016.32</a>.</p>
</div>
<div id="ref-1993-reed">
<p>Reed, R. 1993. “Pruning Algorithms-a Survey.” <em>IEEE Transactions on Neural Networks</em> 4 (5): 740–47. <a href="https://doi.org/10.1109/72.248452">https://doi.org/10.1109/72.248452</a>.</p>
</div>
<div id="ref-2020-renda">
<p>Renda, Alex, Jonathan Frankle, and Michael Carbin. 2020. “Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” <a href="http://arxiv.org/abs/2003.02389">http://arxiv.org/abs/2003.02389</a>.</p>
</div>
<div id="ref-2019-renggli">
<p>Renggli, Cèdric, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. “SparCML: High-Performance Sparse Communication for Machine Learning.” In <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>, 1–15. <a href="http://arxiv.org/abs/1802.08021">http://arxiv.org/abs/1802.08021</a>.</p>
</div>
<div id="ref-reuther2020survey">
<p>Reuther, Albert, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. “Survey of Machine Learning Accelerators.” <a href="http://arxiv.org/abs/2009.00993">http://arxiv.org/abs/2009.00993</a>.</p>
</div>
<div id="ref-2014-rezende">
<p>Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Variational Inference in Deep Latent Gaussian Models.” In <em>International Conference on Machine Learning</em>. Vol. 2.</p>
</div>
<div id="ref-2018-rhu">
<p>Rhu, Minsoo, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. “Compressing Dma Engine: Leveraging Activation Sparsity for Training Deep Neural Networks.” In <em>2018 Ieee International Symposium on High Performance Computer Architecture (Hpca)</em>, 78–91. IEEE.</p>
</div>
<div id="ref-2021-rogers">
<p>Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. 2021. “A Primer in BERTology: What We Know About How Bert Works.” <em>Transactions of the Association for Computational Linguistics</em> 8: 842–66. <a href="http://arxiv.org/abs/2002.12327">http://arxiv.org/abs/2002.12327</a>.</p>
</div>
<div id="ref-rosenbaum2017routing">
<p>Rosenbaum, Clemens, Tim Klinger, and Matthew Riemer. 2017. “Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning.” <a href="http://arxiv.org/abs/1711.01239">http://arxiv.org/abs/1711.01239</a>.</p>
</div>
<div id="ref-2020-russell">
<p>Russell, Stuart, and Peter Norvig. 2020. <em>Artificial Intelligence: A Modern Approach</em>. 4th ed. Prentice Hall Press.</p>
</div>
<div id="ref-6638949">
<p>Sainath, T. N., B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. 2013. “Low-Rank Matrix Factorization for Deep Neural Network Training with High-Dimensional Output Targets.” In <em>2013 Ieee International Conference on Acoustics, Speech and Signal Processing</em>, 6655–9. <a href="https://doi.org/10.1109/ICASSP.2013.6638949">https://doi.org/10.1109/ICASSP.2013.6638949</a>.</p>
</div>
<div id="ref-2020-sanh">
<p>Sanh, Victor, Thomas Wolf, and Alexander M. Rush. 2020. “Movement Pruning: Adaptive Sparsity by Fine-Tuning.” <a href="http://arxiv.org/abs/2005.07683">http://arxiv.org/abs/2005.07683</a>.</p>
</div>
<div id="ref-2020-savarese">
<p>Savarese, Pedro, Hugo Silva, and Michael Maire. 2020. “Winning the Lottery with Continuous Sparsification.” <a href="http://arxiv.org/abs/1912.04427">http://arxiv.org/abs/1912.04427</a>.</p>
</div>
<div id="ref-2016-scardapane">
<p>Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2017. “Group Sparse Regularization for Deep Neural Networks.” <em>Neurocomputing</em> 241: 81–89. <a href="https://doi.org/https://doi.org/10.1016/j.neucom.2017.02.029">https://doi.org/https://doi.org/10.1016/j.neucom.2017.02.029</a>.</p>
</div>
<div id="ref-issr">
<p>Scheffler, Paul, Florian Zaruba, Fabian Schuiki, Torsten Hoefler, and Luca Benini. 2020. “Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra.” <a href="http://arxiv.org/abs/2011.08070">http://arxiv.org/abs/2011.08070</a>.</p>
</div>
<div id="ref-2016-see">
<p>See, Abigail, Minh-Thang Luong, and Christopher D. Manning. 2016. “Compression of Neural Machine Translation Models via Pruning.” <a href="http://arxiv.org/abs/1606.09274">http://arxiv.org/abs/1606.09274</a>.</p>
</div>
<div id="ref-sehwag2020hydra">
<p>Sehwag, Vikash, Shiqi Wang, Prateek Mittal, and Suman Jana. 2020. “HYDRA: Pruning Adversarially Robust Neural Networks.” <a href="http://arxiv.org/abs/2002.10509">http://arxiv.org/abs/2002.10509</a>.</p>
</div>
<div id="ref-2014-seide">
<p>Seide, Frank, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. “1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs.” In <em>Fifteenth Annual Conference of the International Speech Communication Association</em>.</p>
</div>
<div id="ref-2017-sharma">
<p>Sharma, Aditya, Nikolas Wolfe, and Bhiksha Raj. 2017. “The Incredible Shrinking Neural Network: New Perspectives on Learning Representations Through the Lens of Pruning.” <a href="http://arxiv.org/abs/1701.04465">http://arxiv.org/abs/1701.04465</a>.</p>
</div>
<div id="ref-2017-shazeer">
<p>Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” <a href="http://arxiv.org/abs/1701.06538">http://arxiv.org/abs/1701.06538</a>.</p>
</div>
<div id="ref-2019-shi-gtopk">
<p>Shi, Shaohuai, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. 2019. “A Distributed Synchronous SGD Algorithm with Global Top-K Sparsification for Low Bandwidth Networks.” In <em>2019 Ieee 39th International Conference on Distributed Computing Systems Workshop on Networks</em>, 2238–47. <a href="http://arxiv.org/abs/1901.04359">http://arxiv.org/abs/1901.04359</a>.</p>
</div>
<div id="ref-2019-shi">
<p>Shi, Shaohuai, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, and Xiaowen Chu. 2019. “A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification.” In <em>Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence</em>, 3411–7.</p>
</div>
<div id="ref-2015-shokri">
<p>Shokri, Reza, and Vitaly Shmatikov. 2015. “Privacy-Preserving Deep Learning.” In <em>Proceedings of the 22nd Acm Sigsac Conference on Computer and Communications Security</em>, 1310–21.</p>
</div>
<div id="ref-shwartzziv2017opening">
<p>Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” <a href="http://arxiv.org/abs/1703.00810">http://arxiv.org/abs/1703.00810</a>.</p>
</div>
<div id="ref-1991-sietsma">
<p>Sietsma, Jocelyn, and Robert JF Dow. 1991. “Creating Artificial Neural Networks That Generalize.” <em>Neural Networks</em> 4 (1): 67–79.</p>
</div>
<div id="ref-1988-sietsma">
<p>Sietsma, and Dow. 1988. “Neural Net Pruning-Why and How.” In <em>IEEE 1988 International Conference on Neural Networks</em>, 325–33 vol.1. <a href="https://doi.org/10.1109/ICNN.1988.23864">https://doi.org/10.1109/ICNN.1988.23864</a>.</p>
</div>
<div id="ref-sifre2014rigid">
<p>Sifre, Laurent, and Stéphane Mallat. 2014. “Rigid-Motion Scattering for Image Classification.” PhD thesis, Ecole Polytechnique, CMAP.</p>
</div>
<div id="ref-2020-singh">
<p>Singh, Sidak Pal, and Dan Alistarh. 2020. “WoodFisher: Efficient Second-Order Approximation for Neural Network Compression.” <a href="http://arxiv.org/abs/2004.14340">http://arxiv.org/abs/2004.14340</a>.</p>
</div>
<div id="ref-2020-sinha">
<p>Sinha, Samarth, Zhengli Zhao, Anirudh Goyal, Colin A Raffel, and Augustus Odena. 2020. “Top-K Training of GANs: Improving GAN Performance by Throwing Away Bad Samples.” In <em>Advances in Neural Information Processing Systems</em>. <a href="http://arxiv.org/abs/2002.06224">http://arxiv.org/abs/2002.06224</a>.</p>
</div>
<div id="ref-smith2018dont">
<p>Smith, Samuel L., Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2018. “Don’t Decay the Learning Rate, Increase the Batch Size.” <a href="http://arxiv.org/abs/1711.00489">http://arxiv.org/abs/1711.00489</a>.</p>
</div>
<div id="ref-2013-socher">
<p>Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. “Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank.” In <em>Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</em>, 1631–42.</p>
</div>
<div id="ref-2015-srinivas">
<p>Srinivas, Suraj, and R. Venkatesh Babu. 2015. “Data-Free Parameter Pruning for Deep Neural Networks.” <a href="http://arxiv.org/abs/1507.06149">http://arxiv.org/abs/1507.06149</a>.</p>
</div>
<div id="ref-2015-srinivas-relu">
<p>———. 2016. “Learning Neural Network Architectures Using Backpropagation.” <a href="http://arxiv.org/abs/1511.05497">http://arxiv.org/abs/1511.05497</a>.</p>
</div>
<div id="ref-2017-srinivas">
<p>Srinivas, Suraj, Akshayvarun Subramanya, and R. Venkatesh Babu. 2016. “Training Sparse Neural Networks.” <a href="http://arxiv.org/abs/1611.06694">http://arxiv.org/abs/1611.06694</a>.</p>
</div>
<div id="ref-2014-srivastava">
<p>Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014a. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” <em>Journal of Machine Learning Research</em> 15 (56): 1929–58.</p>
</div>
<div id="ref-10.5555/2627435.2670313">
<p>———. 2014b. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” <em>J. Mach. Learn. Res.</em> 15 (1): 1929–58.</p>
</div>
<div id="ref-2018-stich">
<p>Stich, Sebastian U, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. “Sparsified SGD with Memory.” In <em>Advances in Neural Information Processing Systems</em>, 4447–58. <a href="http://arxiv.org/abs/1809.07599">http://arxiv.org/abs/1809.07599</a>.</p>
</div>
<div id="ref-2015-strom">
<p>Strom, Nikko. 2015. “Scalable Distributed DNN Training Using Commodity GPU Cloud Computing.” In <em>Sixteenth Annual Conference of the International Speech Communication Association</em>.</p>
</div>
<div id="ref-1997-stroem">
<p>Ström, Nikko. 1997. “Sparse Connection and Pruning in Large Dynamic Artificial Neural Networks.” In <em>Fifth European Conference on Speech Communication and Technology</em>.</p>
</div>
<div id="ref-2020-su">
<p>Su, Jingtong, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee. 2020. “Sanity-Checking Pruning Methods: Random Tickets Can Win the Jackpot.” <a href="http://arxiv.org/abs/2009.11094">http://arxiv.org/abs/2009.11094</a>.</p>
</div>
<div id="ref-2018-suau">
<p>Suau, Xavier, Luca Zappella, and Nicholas Apostoloff. 2019. “Filter Distillation for Network Compression.” <a href="http://arxiv.org/abs/1807.10585">http://arxiv.org/abs/1807.10585</a>.</p>
</div>
<div id="ref-2019-sun">
<p>Sun, Haobo, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, and Jiang Wang. 2019. “Sparse Gradient Compression for Distributed SGD.” In <em>International Conference on Database Systems for Advanced Applications</em>, 139–55. Springer.</p>
</div>
<div id="ref-2017-sun">
<p>Sun, Xu, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. “meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting.” In <em>Proceedings of the Thirty-Fourth International Conference on Machine Learning</em>. <a href="http://arxiv.org/abs/1706.06197">http://arxiv.org/abs/1706.06197</a>.</p>
</div>
<div id="ref-2015-sun">
<p>Sun, Yi, Xiaogang Wang, and Xiaoou Tang. 2015. “Sparsifying Neural Network Connections for Face Recognition.” <a href="http://arxiv.org/abs/1512.01891">http://arxiv.org/abs/1512.01891</a>.</p>
</div>
<div id="ref-2017-suresh">
<p>Suresh, Ananda Theertha, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. 2017. “Distributed Mean Estimation with Limited Communication.” In <em>International Conference on Machine Learning</em>, 3329–37. <a href="http://arxiv.org/abs/1611.00429">http://arxiv.org/abs/1611.00429</a>.</p>
</div>
<div id="ref-2001-suzuki">
<p>Suzuki, Kenji, Isao Horiba, and Noboru Sugie. 2001. “A Simple Neural Network Pruning Algorithm with Application to Filter Synthesis.” In <em>Neural Processing Letters</em>, 43–53.</p>
</div>
<div id="ref-mlhwsurvey">
<p>Sze, V., Y. Chen, T. Yang, and J. S. Emer. 2017. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” <em>Proceedings of the IEEE</em> 105 (12): 2295–2329. <a href="https://doi.org/10.1109/JPROC.2017.2761740">https://doi.org/10.1109/JPROC.2017.2761740</a>.</p>
</div>
<div id="ref-2014-szegedy">
<p>Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper with Convolutions.” In <em>Computer Vision and Pattern Recognition (Cvpr)</em>. <a href="http://arxiv.org/abs/1409.4842">http://arxiv.org/abs/1409.4842</a>.</p>
</div>
<div id="ref-2016-szegedy">
<p>Szegedy, C., V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. “Rethinking the Inception Architecture for Computer Vision.” In <em>2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 2818–26. Los Alamitos, CA, USA: IEEE Computer Society. <a href="https://doi.org/10.1109/CVPR.2016.308">https://doi.org/10.1109/CVPR.2016.308</a>.</p>
</div>
<div id="ref-1993-tamura">
<p>Tamura, S., M. Tateishi, M. Matumoto, and S. Akita. 1993. “Determination of the Number of Redundant Hidden Units in a Three-Layered Feedforward Neural Network.” In <em>Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan)</em>, 1:335–38 vol.1. <a href="https://doi.org/10.1109/IJCNN.1993.713925">https://doi.org/10.1109/IJCNN.1993.713925</a>.</p>
</div>
<div id="ref-2020-tan-drop">
<p>Tan, Chong Min John, and Mehul Motani. 2020. “DropNet: Reducing Neural Network Complexity via Iterative Pruning.” In <em>Proceedings of the 37th International Conference on Machine Learning</em>, edited by Hal Daumé III and Aarti Singh, 119:9356–66. Proceedings of Machine Learning Research. PMLR. <a href=" http://proceedings.mlr.press/v119/tan20a.html ">http://proceedings.mlr.press/v119/tan20a.html</a>.</p>
</div>
<div id="ref-2019-tan">
<p>Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. “MnasNet: Platform-Aware Neural Architecture Search for Mobile.” <a href="http://arxiv.org/abs/1807.11626">http://arxiv.org/abs/1807.11626</a>.</p>
</div>
<div id="ref-2020-tan">
<p>Tan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1905.11946">http://arxiv.org/abs/1905.11946</a>.</p>
</div>
<div id="ref-2020-tanaka">
<p>Tanaka, Hidenori, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. 2020. “Pruning Neural Networks Without Any Data by Iteratively Conserving Synaptic Flow.” <a href="http://arxiv.org/abs/2006.05467">http://arxiv.org/abs/2006.05467</a>.</p>
</div>
<div id="ref-2019-tang">
<p>Tang, Hanlin, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. 2019. “DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression.” In <em>Proceedings of the Thirty-Sixth International Conference on Machine Learning</em>, 6155–65. <a href="http://arxiv.org/abs/1905.05957">http://arxiv.org/abs/1905.05957</a>.</p>
</div>
<div id="ref-2020-tang-scop">
<p>Tang, Yehui, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. 2021. “SCOP: Scientific Control for Reliable Neural Network Pruning.” <a href="http://arxiv.org/abs/2010.10732">http://arxiv.org/abs/2010.10732</a>.</p>
</div>
<div id="ref-2020-tang">
<p>Tang, Zhenheng, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. 2020. “Communication-Efficient Distributed Deep Learning: A Comprehensive Survey.” <a href="http://arxiv.org/abs/2003.06307">http://arxiv.org/abs/2003.06307</a>.</p>
</div>
<div id="ref-2018-tartaglione">
<p>Tartaglione, Enzo, Skjalg Lepsøy, Attilio Fiandrotti, and Gianluca Francini. 2018. “Learning Sparse Neural Networks via Sensitivity-Driven Regularization.” <a href="http://arxiv.org/abs/1810.11764">http://arxiv.org/abs/1810.11764</a>.</p>
</div>
<div id="ref-2021-tay">
<p>Tay, Yi, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021. “Long Range Arena: A Benchmark for Efficient Transformers.” In <em>Proceedings of the Ninth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/2011.04006">http://arxiv.org/abs/2011.04006</a>.</p>
</div>
<div id="ref-2020-tay">
<p>Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. “Efficient Transformers: A Survey.” <a href="http://arxiv.org/abs/2009.06732">http://arxiv.org/abs/2009.06732</a>.</p>
</div>
<div id="ref-2019-tenney">
<p>Tenney, Ian, Dipanjan Das, and Ellie Pavlick. 2019. “BERT Rediscovers the Classical NLP Pipeline.” In <em>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</em>, 4593–4601. <a href="http://arxiv.org/abs/1905.05950">http://arxiv.org/abs/1905.05950</a>.</p>
</div>
<div id="ref-2018-theis">
<p>Theis, Lucas, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. 2018. “Faster Gaze Prediction with Dense Networks and Fisher Pruning.” <a href="http://arxiv.org/abs/1801.05787">http://arxiv.org/abs/1801.05787</a>.</p>
</div>
<div id="ref-1995-thimm">
<p>Thimm, Georg, and Emile Fiesler. 1995. “Evaluating Pruning Methods.” In <em>National Chiao-Tung University</em>, 2.</p>
</div>
<div id="ref-tibshirani1996regression">
<p>Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” <em>Journal of the Royal Statistical Society: Series B (Methodological)</em> 58 (1): 267–88.</p>
</div>
<div id="ref-2001-tipping">
<p>Tipping, Michael E. 2001. “Sparse Bayesian Learning and the Relevance Vector Machine.” <em>Journal of Machine Learning Research</em> 1 (Jun): 211–44.</p>
</div>
<div id="ref-2015-tompson">
<p>Tompson, Jonathan, Ross Goroshin, Arjun Jain, Yann LeCun, and Christopher Bregler. 2015. “Efficient Object Localization Using Convolutional Networks.” <a href="http://arxiv.org/abs/1411.4280">http://arxiv.org/abs/1411.4280</a>.</p>
</div>
<div id="ref-2018-tsuzaku">
<p>Tsuzuku, Yusuke, Hiroto Imachi, and Takuya Akiba. 2018. “Variance-Based Gradient Compression for Efficient Distributed Deep Learning.” In <em>Proceedings of the Sixth International Conference on Learning Representations, Workshop Track</em>. <a href="http://arxiv.org/abs/1802.06058">http://arxiv.org/abs/1802.06058</a>.</p>
</div>
<div id="ref-2017-ullrich">
<p>Ullrich, Karen, Edward Meeds, and Max Welling. 2017. “Soft Weight-Sharing for Neural Network Compression.” <a href="http://arxiv.org/abs/1702.04008">http://arxiv.org/abs/1702.04008</a>.</p>
</div>
<div id="ref-unat-locality">
<p>Unat, Didem, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, et al. 2017. “Trends in Data Locality Abstractions for HPC Systems.” <em>IEEE Transactions on Parallel and Distributed Systems (TPDS)</em> 28 (10).</p>
</div>
<div id="ref-vaswani2017attention">
<p>Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” <a href="http://arxiv.org/abs/1706.03762">http://arxiv.org/abs/1706.03762</a>.</p>
</div>
<div id="ref-2020-verdenius">
<p>Verdenius, Stijn, Maarten Stol, and Patrick Forré. 2020. “Pruning via Iterative Ranking of Sensitivity Statistics.” <a href="http://arxiv.org/abs/2006.00896">http://arxiv.org/abs/2006.00896</a>.</p>
</div>
<div id="ref-2019-voita">
<p>Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” <a href="http://arxiv.org/abs/1905.09418">http://arxiv.org/abs/1905.09418</a>.</p>
</div>
<div id="ref-2013-wan">
<p>Wan, Li, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. “Regularization of Neural Networks Using Dropconnect.” In <em>Proceedings of the 30th International Conference on Machine Learning</em>, edited by Sanjoy Dasgupta and David McAllester, 28:1058–66. Proceedings of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR. <a href="http://proceedings.mlr.press/v28/wan13.html">http://proceedings.mlr.press/v28/wan13.html</a>.</p>
</div>
<div id="ref-2019-wang-glue">
<p>Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In <em>Proceedings of the Seventh International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1804.07461">http://arxiv.org/abs/1804.07461</a>.</p>
</div>
<div id="ref-2019-wang">
<p>Wang, Chaoqi, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. “Eigendamage: Structured Pruning in the Kronecker-Factored Eigenbasis.” <a href="http://arxiv.org/abs/1905.05934">http://arxiv.org/abs/1905.05934</a>.</p>
</div>
<div id="ref-2018-wang">
<p>Wang, Hongyi, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. 2018. “ATOMO: Communication-Efficient Learning via Atomic Sparsification.” In <em>Advances in Neural Information Processing Systems</em>, 9850–61. <a href="http://arxiv.org/abs/1806.04090">http://arxiv.org/abs/1806.04090</a>.</p>
</div>
<div id="ref-2020-wang">
<p>Wang, Linnan, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020. “FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks.” In <em>Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing</em>, 113–24.</p>
</div>
<div id="ref-2020-wang-tformer">
<p>Wang, Ziheng, Jeremy Wohlwend, and Tao Lei. 2020. “Structured Pruning of Large Language Models.” In <em>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (Emnlp)</em>, 6151–62. <a href="http://arxiv.org/abs/1910.04732">http://arxiv.org/abs/1910.04732</a>.</p>
</div>
<div id="ref-2018-wangni">
<p>Wangni, Jianqiao, Jialei Wang, Ji Liu, and Tong Zhang. 2018. “Gradient Sparsification for Communication-Efficient Distributed Optimization.” In <em>Advances in Neural Information Processing Systems</em>, 1299–1309. <a href="http://arxiv.org/abs/1710.09854">http://arxiv.org/abs/1710.09854</a>.</p>
</div>
<div id="ref-2019-warstadt">
<p>Warstadt, Alex, Amanpreet Singh, and Samuel R Bowman. 2019. “Neural Network Acceptability Judgments.” <em>Transactions of the Association for Computational Linguistics</em> 7: 625–41. <a href="http://arxiv.org/abs/1805.12471">http://arxiv.org/abs/1805.12471</a>.</p>
</div>
<div id="ref-wei2017minimal">
<p>Wei, Bingzhen, Xu Sun, Xuancheng Ren, and Jingjing Xu. 2017. “Minimal Effort Back Propagation for Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1709.05804">http://arxiv.org/abs/1709.05804</a>.</p>
</div>
<div id="ref-2016-wen">
<p>Wen, Wei, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. “Learning Structured Sparsity in Deep Neural Networks.” <a href="http://arxiv.org/abs/1608.03665">http://arxiv.org/abs/1608.03665</a>.</p>
</div>
<div id="ref-10.5555/646365.691221">
<p>White, David, and Panos A. Ligomenides. 1993. “GANNet: A Genetic Algorithm for Optimizing Topology and Weights in Neural Network Design.” In <em>Proceedings of the International Workshop on Artificial Neural Networks: New Trends in Neural Computation</em>, 322–27. IWANN ’93. Berlin, Heidelberg: Springer-Verlag.</p>
</div>
<div id="ref-whitley:ijcnn90">
<p>Whitley, D., and C. Bogart. 1990. “The Evolution of Connectivity: Pruning Neural Networks Using Genetic Algorithms.” In <em>Proceedings of the International Joint Conference on Neural Networks (Washington, DC)</em>, 134–37. IEEE Press.</p>
</div>
<div id="ref-2018-williams">
<p>Williams, Adina, Nikita Nangia, and Samuel R Bowman. 2018. “A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference.” In <em>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</em>. <a href="http://arxiv.org/abs/1704.05426">http://arxiv.org/abs/1704.05426</a>.</p>
</div>
<div id="ref-1995-williams">
<p>Williams, P. M. 1995. “Bayesian Regularization and Pruning Using a Laplace Prior.” <em>Neural Computation</em> 7 (1): 117–43. <a href="https://doi.org/10.1162/neco.1995.7.1.117">https://doi.org/10.1162/neco.1995.7.1.117</a>.</p>
</div>
<div id="ref-2019-wortsman">
<p>Wortsman, Mitchell, Ali Farhadi, and Mohammad Rastegari. 2019. “Discovering Neural Wirings.” <a href="http://arxiv.org/abs/1906.00586">http://arxiv.org/abs/1906.00586</a>.</p>
</div>
<div id="ref-2020-wortsman">
<p>Wortsman, Mitchell, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. “Supermasks in Superposition.” <a href="http://arxiv.org/abs/2006.14769">http://arxiv.org/abs/2006.14769</a>.</p>
</div>
<div id="ref-2017-wu">
<p>Wu, Yuhuai, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. 2017. “Second-Order Optimization for Deep Reinforcement Learning Using Kronecker-Factored Approximation.” In <em>NIPS</em>, 5285–94. <a href="http://papers.nips.cc/paper/7112-second-order-optimization-for-deep-reinforcement-learning-using-kronecker-factored-approximation">http://papers.nips.cc/paper/7112-second-order-optimization-for-deep-reinforcement-learning-using-kronecker-factored-approximation</a>.</p>
</div>
<div id="ref-2019-xiao">
<p>Xiao, Xia, Zigeng Wang, and Sanguthevar Rajasekaran. 2019. “AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters.” In <em>Advances in Neural Information Processing Systems</em>, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, 32:13681–91. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2019/file/4efc9e02abdab6b6166251918570a307-Paper.pdf">https://proceedings.neurips.cc/paper/2019/file/4efc9e02abdab6b6166251918570a307-Paper.pdf</a>.</p>
</div>
<div id="ref-2006-xu">
<p>Xu, Jinhua, and Daniel WC Ho. 2006. “A New Training and Pruning Algorithm Based on Node Dependence and Jacobian Rank Deficiency.” <em>Neurocomputing</em> 70 (1-3): 544–58.</p>
</div>
<div id="ref-2020-yang">
<p>Yang, Dingqing, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2020. “Procrustes: A Dataflow and Accelerator for Sparse Deep Neural Network Training.” <a href="http://arxiv.org/abs/2009.10976">http://arxiv.org/abs/2009.10976</a>.</p>
</div>
<div id="ref-2019-yang">
<p>Yang, Huanrui, Wei Wen, and Hai Li. 2020. “DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures.” <a href="http://arxiv.org/abs/1908.09979">http://arxiv.org/abs/1908.09979</a>.</p>
</div>
<div id="ref-2017-yang">
<p>Yang, Tien-Ju, Yu-Hsin Chen, and Vivienne Sze. 2017. “Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning.” <a href="http://arxiv.org/abs/1611.05128">http://arxiv.org/abs/1611.05128</a>.</p>
</div>
<div id="ref-2018-ye">
<p>Ye, Jianbo, Xin Lu, Zhe Lin, and James Z Wang. 2018. “Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers.” <a href="http://arxiv.org/abs/1802.00124">http://arxiv.org/abs/1802.00124</a>.</p>
</div>
<div id="ref-2020-ye">
<p>Ye, Mao, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. 2020. “Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection.” <a href="http://arxiv.org/abs/2003.01794">http://arxiv.org/abs/2003.01794</a>.</p>
</div>
<div id="ref-yin2019understanding">
<p>Yin, Penghang, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. 2019. “Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets.” <a href="http://arxiv.org/abs/1903.05662">http://arxiv.org/abs/1903.05662</a>.</p>
</div>
<div id="ref-2020-you">
<p>You, Haoran, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020. “Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks.” <a href="http://arxiv.org/abs/1909.11957">http://arxiv.org/abs/1909.11957</a>.</p>
</div>
<div id="ref-2019-you">
<p>You, Zhonghui, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. 2019. “Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1909.08174">http://arxiv.org/abs/1909.08174</a>.</p>
</div>
<div id="ref-2012-yu">
<p>Yu, D., F. Seide, G. Li, and L. Deng. 2012. “Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition.” In <em>2012 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp)</em>, 4409–12. <a href="https://doi.org/10.1109/ICASSP.2012.6288897">https://doi.org/10.1109/ICASSP.2012.6288897</a>.</p>
</div>
<div id="ref-2017-yu">
<p>Yu, Jiecao, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. “Scalpel: Customizing Dnn Pruning to the Underlying Hardware Parallelism.” <em>ACM SIGARCH Computer Architecture News</em> 45 (2): 548–60.</p>
</div>
<div id="ref-2018-yu">
<p>Yu, Ruichi, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. 2018. “NISP: Pruning Networks Using Neuron Importance Score Propagation.” <a href="http://arxiv.org/abs/1711.05908">http://arxiv.org/abs/1711.05908</a>.</p>
</div>
<div id="ref-2018-yu-epsresnets">
<p>Yu, Xin, Zhiding Yu, and Srikumar Ramalingam. 2018. “Learning Strict Identity Mappings in Deep Residual Networks.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 4432–40. <a href="http://arxiv.org/abs/1804.01661">http://arxiv.org/abs/1804.01661</a>.</p>
</div>
<div id="ref-2007-yuan">
<p>Yuan, Ming, and Yi Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” <em>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</em> 68 (1): 49–67.</p>
</div>
<div id="ref-2020-yun">
<p>Yun, Chulhee, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2020. “<span class="math inline"><em>O</em>(<em>n</em>)</span> Connections Are Expressive Enough: Universal Approximability of Sparse Transformers.” In <em>Advances in Neural Information Processing Systems</em>.</p>
</div>
<div id="ref-2020-zaheer">
<p>Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, et al. 2020. “Big Bird: Transformers for Longer Sequences.” In <em>Advances in Neural Information Processing Systems</em>. <a href="http://arxiv.org/abs/2007.14062">http://arxiv.org/abs/2007.14062</a>.</p>
</div>
<div id="ref-2019-mlprune">
<p>Zeng, Wenyuan, and Raquel Urtasun. 2019a. “MLPrune: Multi-Layer Pruning for Automated Neural Network Compression.” <a href="https://openreview.net/forum?id=r1g5b2RcKm">https://openreview.net/forum?id=r1g5b2RcKm</a>.</p>
</div>
<div id="ref-2019-zheng">
<p>———. 2019b. “MLPrune: Multi-Layer Pruning for Automated Neural Network Compression.” <a href="https://openreview.net/forum?id=r1g5b2RcKm">https://openreview.net/forum?id=r1g5b2RcKm</a>.</p>
</div>
<div id="ref-2005-zeng">
<p>Zeng, Xiaoqin, and Daniel S Yeung. 2006. “Hidden Neuron Pruning of Multilayer Perceptrons Using a Quantified Sensitivity Measure.” <em>Neurocomputing</em> 69 (7-9): 825–37.</p>
</div>
<div id="ref-zhang2017understanding">
<p>Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” <a href="http://arxiv.org/abs/1611.03530">http://arxiv.org/abs/1611.03530</a>.</p>
</div>
<div id="ref-2019-zhang-compact">
<p>Zhang, Jeff (Jun), Parul Raj, Shuayb Zarar, Amol Ambardekar, and Siddharth Garg. 2019. “CompAct: On-Chip Compression of Activations for Low Power Systolic Array Based Cnn Acceleration.” <em>ACM Trans. Embed. Comput. Syst.</em> 18 (5s). <a href="https://doi.org/10.1145/3358178">https://doi.org/10.1145/3358178</a>.</p>
</div>
<div id="ref-2019-zhang-eager">
<p>Zhang, Jiaqi, Xiangru Chen, Mingcong Song, and Tao Li. 2019. “Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks.” In <em>Proceedings of the 46th International Symposium on Computer Architecture</em>, 292–303. ISCA ’19. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3307650.3322263">https://doi.org/10.1145/3307650.3322263</a>.</p>
</div>
<div id="ref-2019-zhang-snap">
<p>Zhang, Jie-Fang, Ching-En Lee, C. Liu, Y. Shao, Stephen W. Keckler, and Zhengya Zhang. 2019. “SNAP: A 1.67 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm Cmos.” <em>2019 Symposium on VLSI Circuits</em>, C306–C307.</p>
</div>
<div id="ref-2016-zhang">
<p>Zhang, S., Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. “Cambricon-X: An Accelerator for Sparse Neural Networks.” In <em>2016 49th Annual Ieee/Acm International Symposium on Microarchitecture (Micro)</em>, 1–12. <a href="https://doi.org/10.1109/MICRO.2016.7783723">https://doi.org/10.1109/MICRO.2016.7783723</a>.</p>
</div>
<div id="ref-2015-zhang">
<p>Zhang, Xiangyu, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. 2015. “Efficient and Accurate Approximations of Nonlinear Convolutional Networks.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 1984–92.</p>
</div>
<div id="ref-2020-zhang">
<p>Zhang, Zhekai, Hanrui Wang, Song Han, and William J. Dally. 2020. “SpArch: Efficient Architecture for Sparse Matrix Multiplication.” <a href="http://arxiv.org/abs/2002.08947">http://arxiv.org/abs/2002.08947</a>.</p>
</div>
<div id="ref-2019-zhao">
<p>Zhao, Guangxiang, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019. “Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection.” <a href="http://arxiv.org/abs/1912.11637">http://arxiv.org/abs/1912.11637</a>.</p>
</div>
<div id="ref-zhao2017learning">
<p>Zhao, Qibin, Masashi Sugiyama, and Andrzej Cichocki. 2017. “Learning Efficient Tensor Representations with Ring Structure Networks.” <a href="http://arxiv.org/abs/1705.08286">http://arxiv.org/abs/1705.08286</a>.</p>
</div>
<div id="ref-1999-zhou">
<p>Zhou, Guian, and Jennie Si. 1999. “Subset-Based Training and Pruning of Sigmoid Neural Networks.” <em>Neural Networks</em> 12 (1): 79–89.</p>
</div>
<div id="ref-2016-zhou">
<p>Zhou, Hao, Jose M Alvarez, and Fatih Porikli. 2016. “Less Is More: Towards Compact Cnns.” In <em>European Conference on Computer Vision</em>, 662–77. Springer.</p>
</div>
<div id="ref-2019-zhou">
<p>Zhou, Hattie, Janice Lan, Rosanne Liu, and Jason Yosinski. 2020. “Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.” <a href="http://arxiv.org/abs/1905.01067">http://arxiv.org/abs/1905.01067</a>.</p>
</div>
<div id="ref-2018-zhou">
<p>Zhou, X., Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. “Cambricon-S: Addressing Irregularity in Sparse Neural Networks Through a Cooperative Software/Hardware Approach.” In <em>2018 51st Annual Ieee/Acm International Symposium on Microarchitecture (Micro)</em>, 15–28. <a href="https://doi.org/10.1109/MICRO.2018.00011">https://doi.org/10.1109/MICRO.2018.00011</a>.</p>
</div>
<div id="ref-zhou2021efficient">
<p>Zhou, Xiao, Weizhong Zhang, Zonghao Chen, Shizhe Diao, and Tong Zhang. 2021. “Efficient Neural Network Training via Forward and Backward Propagation Sparsification.” <em>Advances in Neural Information Processing Systems</em>.</p>
</div>
<div id="ref-zhou2021effective">
<p>Zhou, Xiao, Weizhong Zhang, Hang Xu, and Tong Zhang. 2021. “Effective Sparsification of Neural Networks with Global Sparsity Constraint.” In <em>Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition</em>, 3599–3608.</p>
</div>
<div id="ref-2018-zhu">
<p>Zhu, Jingyang, Jingbo Jiang, Xizi Chen, and Chi-Ying Tsui. 2017. “SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity.” <a href="http://arxiv.org/abs/1711.01263">http://arxiv.org/abs/1711.01263</a>.</p>
</div>
<div id="ref-2016-zhu">
<p>Zhu, Jingyang, Zhiliang Qian, and Chi-Ying Tsui. 2016. “LRADNN: High-Throughput and Energy-Efficient Deep Neural Network Accelerator Using Low Rank Approximation.” In <em>2016 21st Asia and South Pacific Design Automation Conference (Asp-Dac)</em>, 581–86. <a href="https://doi.org/10.1109/ASPDAC.2016.7428074">https://doi.org/10.1109/ASPDAC.2016.7428074</a>.</p>
</div>
<div id="ref-2017-zhu">
<p>Zhu, Michael, and Suyog Gupta. 2017. “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression.” <a href="http://arxiv.org/abs/1710.01878">http://arxiv.org/abs/1710.01878</a>.</p>
</div>
<div id="ref-2020-zhuang">
<p>Zhuang, Tao, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. 2020. “Neuron-Level Structured Pruning Using Polarization Regularizer.” <em>Advances in Neural Information Processing Systems</em> 33.</p>
</div>
<div id="ref-2019-zhuang">
<p>Zhuang, Zhuangwei, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2019. “Discrimination-Aware Channel Pruning for Deep Neural Networks.” <a href="http://arxiv.org/abs/1810.11809">http://arxiv.org/abs/1810.11809</a>.</p>
</div>
</div>