-
Notifications
You must be signed in to change notification settings - Fork 3
/
index.html
1362 lines (1362 loc) · 131 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<h1 id="papers-that-use-sparsity-in-deep-learning">Papers that use sparsity in deep learning</h1>
<p>This is a list of papers curated for the paper “Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks”.</p>
<p>The following list is automatically generated from <code>sparsity.bib</code>. To contribute to this list, please set up a Pull Request and add new bibtex entries.</p>
<h2 id="papers" class="unnumbered">Papers</h2>
<div id="refs" class="references">
<div id="ref-achille2019critical">
<p>Achille, Alessandro, Matteo Rovere, and Stefano Soatto. 2019. “Critical Learning Periods in Deep Neural Networks.” <a href="http://arxiv.org/abs/1711.08856">http://arxiv.org/abs/1711.08856</a>.</p>
</div>
<div id="ref-2020-afghan">
<p>Afghan, Sher, and Uwe Naumann. 2020. “Interval Adjoint Significance Analysis for Neural Networks.” In <em>International Conference on Computational Science</em>, 365–78. Springer.</p>
</div>
<div id="ref-2016-aghasi">
<p>Aghasi, Alireza, Afshin Abdi, Nam Nguyen, and Justin Romberg. 2017. “Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee.” <a href="http://arxiv.org/abs/1611.05162">http://arxiv.org/abs/1611.05162</a>.</p>
</div>
<div id="ref-ahmad2019dense">
<p>Ahmad, Subutai, and Luiz Scheinkman. 2019. “How Can We Be so Dense? The Benefits of Using Highly Sparse Representations.” <a href="http://arxiv.org/abs/1903.11257">http://arxiv.org/abs/1903.11257</a>.</p>
</div>
<div id="ref-2017-aji">
<p>Aji, Alham Fikriand, and Kenneth Heafield. 2017. “Sparse Communication for Distributed Gradient Descent.” In <em>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</em>, 440–45. <a href="http://arxiv.org/abs/1704.05021">http://arxiv.org/abs/1704.05021</a>.</p>
</div>
<div id="ref-2016-albericio">
<p>Albericio, J., P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2016. “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.” In <em>2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca)</em>, 1–13. <a href="https://doi.org/10.1109/ISCA.2016.11">https://doi.org/10.1109/ISCA.2016.11</a>.</p>
</div>
<div id="ref-alistarh2017qsgd">
<p>Alistarh, Dan, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. “QSGD: Communication-Efficient Sgd via Gradient Quantization and Encoding.” <a href="http://arxiv.org/abs/1610.02132">http://arxiv.org/abs/1610.02132</a>.</p>
</div>
<div id="ref-2018-alistarh">
<p>Alistarh, Dan, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. “The Convergence of Sparsified Gradient Methods.” In <em>Advances in Neural Information Processing Systems</em>, 5973–83. <a href="http://arxiv.org/abs/1809.10505">http://arxiv.org/abs/1809.10505</a>.</p>
</div>
<div id="ref-allenzhu2019convergence">
<p>Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2019. “A Convergence Theory for Deep Learning via over-Parameterization.” <a href="http://arxiv.org/abs/1811.03962">http://arxiv.org/abs/1811.03962</a>.</p>
</div>
<div id="ref-almahairi2016dynamic">
<p>Almahairi, Amjad, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. “Dynamic Capacity Networks.” <a href="http://arxiv.org/abs/1511.07838">http://arxiv.org/abs/1511.07838</a>.</p>
</div>
<div id="ref-2017-alvarez">
<p>Alvarez, Jose M., and Mathieu Salzmann. 2017. “Compression-Aware Training of Deep Networks.” <a href="http://arxiv.org/abs/1711.02638">http://arxiv.org/abs/1711.02638</a>.</p>
</div>
<div id="ref-2016-alwani">
<p>Alwani, Manoj, Han Chen, Michael Ferdman, and Peter Milder. 2016. “Fused-Layer Cnn Accelerators.” In <em>The 49th Annual Ieee/Acm International Symposium on Microarchitecture</em>, 22. IEEE Press.</p>
</div>
<div id="ref-1998-amari">
<p>Amari, Shun-ichi. 1998. “Natural Gradient Works Efficiently in Learning.” <em>Neural Computation</em> 10 (2): 251–76. <a href="https://doi.org/10.1162/089976698300017746">https://doi.org/10.1162/089976698300017746</a>.</p>
</div>
<div id="ref-2015-anwar">
<p>Anwar, Sajid, Kyuyeon Hwang, and Wonyong Sung. 2017. “Structured Pruning of Deep Convolutional Neural Networks.” <em>ACM Journal on Emerging Technologies in Computing Systems (JETC)</em> 13 (3): 1–18.</p>
</div>
<div id="ref-2020-atashgahi">
<p>Atashgahi, Zahra, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2020. “Quick and Robust Feature Selection: The Strength of Energy-Efficient Sparse Training for Autoencoders.” <a href="http://arxiv.org/abs/2012.00560">http://arxiv.org/abs/2012.00560</a>.</p>
</div>
<div id="ref-2020-azarian">
<p>Azarian, Kambiz, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. 2020. “Learned Threshold Pruning.” <a href="http://arxiv.org/abs/2003.00075">http://arxiv.org/abs/2003.00075</a>.</p>
</div>
<div id="ref-2016-ba">
<p>Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations.”</p>
</div>
<div id="ref-2016-ba-layernorm">
<p>Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. “Layer Normalization.” <a href="http://arxiv.org/abs/1607.06450">http://arxiv.org/abs/1607.06450</a>.</p>
</div>
<div id="ref-2020-baalen">
<p>Baalen, Mart van, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. “Bayesian Bits: Unifying Quantization and Pruning.” <a href="http://arxiv.org/abs/2005.07093">http://arxiv.org/abs/2005.07093</a>.</p>
</div>
<div id="ref-2013-baldi">
<p>Baldi, Pierre, and Peter J Sadowski. 2013. “Understanding Dropout.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 26:2814–22. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2013/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">https://proceedings.neurips.cc/paper/2013/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf</a>.</p>
</div>
<div id="ref-2019-bartoldson">
<p>Bartoldson, Brian R., Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. “The Generalization-Stability Tradeoff in Neural Network Pruning.” <a href="http://arxiv.org/abs/1906.03728">http://arxiv.org/abs/1906.03728</a>.</p>
</div>
<div id="ref-2020-basu">
<p>Basu, Debraj, Deepesh Data, Can Karakus, and Suhas N Diggavi. 2020. “Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations.” <em>IEEE Journal on Selected Areas in Information Theory</em> 1 (1): 217–26. <a href="http://arxiv.org/abs/1906.02367">http://arxiv.org/abs/1906.02367</a>.</p>
</div>
<div id="ref-2018-baykal">
<p>Baykal, Cenk, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. 2018. “Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds.” <em>arXiv Preprint arXiv:1804.05345</em>.</p>
</div>
<div id="ref-ista">
<p>Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” <em>SIAM J. Img. Sci.</em> 2 (1): 183–202. <a href="https://doi.org/10.1137/080716542">https://doi.org/10.1137/080716542</a>.</p>
</div>
<div id="ref-2018-bellec">
<p>Bellec, Guillaume, David Kappel, Wolfgang Maass, and Robert Legenstein. 2018. “Deep Rewiring: Training Very Sparse Deep Networks.” <a href="http://arxiv.org/abs/1711.05136">http://arxiv.org/abs/1711.05136</a>.</p>
</div>
<div id="ref-beltagy2020longformer">
<p>Beltagy, Iz, Matthew E. Peters, and Arman Cohan. 2020. “Longformer: The Long-Document Transformer.” <a href="http://arxiv.org/abs/2004.05150">http://arxiv.org/abs/2004.05150</a>.</p>
</div>
<div id="ref-bengio2016conditional">
<p>Bengio, Emmanuel, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. “Conditional Computation in Neural Networks for Faster Models.” <a href="http://arxiv.org/abs/1511.06297">http://arxiv.org/abs/1511.06297</a>.</p>
</div>
<div id="ref-2013-bengio">
<p>Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. 2013. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.” <a href="http://arxiv.org/abs/1308.3432">http://arxiv.org/abs/1308.3432</a>.</p>
</div>
<div id="ref-bennun2019modular">
<p>Ben-Nun, Tal, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoefler. 2019. “A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning.” <a href="http://arxiv.org/abs/1901.10183">http://arxiv.org/abs/1901.10183</a>.</p>
</div>
<div id="ref-bennun2018demystifying">
<p>Ben-Nun, Tal, and Torsten Hoefler. 2018. “Demystifying Parallel and Distributed Deep Learning: An in-Depth Concurrency Analysis.” <a href="http://arxiv.org/abs/1802.09941">http://arxiv.org/abs/1802.09941</a>.</p>
</div>
<div id="ref-betzel2017modular">
<p>Betzel, Richard F, John D Medaglia, Lia Papadopoulos, Graham L Baum, Ruben Gur, Raquel Gur, David Roalf, Theodore D Satterthwaite, and Danielle S Bassett. 2017. “The Modular Organization of Human Anatomical Brain Networks: Accounting for the Cost of Wiring.” <em>Network Neuroscience</em> 1 (1): 42–68.</p>
</div>
<div id="ref-2018-bianco">
<p>Bianco, Simone, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. “Benchmark Analysis of Representative Deep Neural Network Architectures.” <em>IEEE Access</em> 6: 64270–7. <a href="https://doi.org/10.1109/access.2018.2877890">https://doi.org/10.1109/access.2018.2877890</a>.</p>
</div>
<div id="ref-2020-blalock">
<p>Blalock, Davis, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. “What Is the State of Neural Network Pruning?” <a href="http://arxiv.org/abs/2003.03033">http://arxiv.org/abs/2003.03033</a>.</p>
</div>
<div id="ref-2017-bourely">
<p>Bourely, Alfred, John Patrick Boueri, and Krzysztof Choromonski. 2017. “Sparse Neural Networks Topologies.” <a href="http://arxiv.org/abs/1706.05683">http://arxiv.org/abs/1706.05683</a>.</p>
</div>
<div id="ref-gpt-3">
<p>Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In <em>Advances in Neural Information Processing Systems</em>. <a href="http://arxiv.org/abs/2005.14165">http://arxiv.org/abs/2005.14165</a>.</p>
</div>
<div id="ref-brutzkus2017sgd">
<p>Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2017. “SGD Learns over-Parameterized Networks That Provably Generalize on Linearly Separable Data.” <a href="http://arxiv.org/abs/1710.10174">http://arxiv.org/abs/1710.10174</a>.</p>
</div>
<div id="ref-713928">
<p>Burrascano, P. 1993. “A Pruning Technique Maximizing Generalization.” In <em>Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan)</em>, 1:347–50 vol.1. <a href="https://doi.org/10.1109/IJCNN.1993.713928">https://doi.org/10.1109/IJCNN.1993.713928</a>.</p>
</div>
<div id="ref-2018-carreira">
<p>Carreira-Perpinan, M. A., and Y. Idelbayev. 2018. “"Learning-Compression" Algorithms for Neural Net Pruning.” In <em>2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition</em>, 8532–41. <a href="https://doi.org/10.1109/CVPR.2018.00890">https://doi.org/10.1109/CVPR.2018.00890</a>.</p>
</div>
<div id="ref-1997-castellano">
<p>Castellano, G., A. M. Fanelli, and M. Pelillo. 1997. “An Iterative Pruning Algorithm for Feedforward Neural Networks.” <em>IEEE Transactions on Neural Networks</em> 8 (3): 519–31. <a href="https://doi.org/10.1109/72.572092">https://doi.org/10.1109/72.572092</a>.</p>
</div>
<div id="ref-2000-castellano">
<p>Castellano, Giovanna, and Anna Maria Fanelli. 2000. “Variable Selection Using Neural-Network Models.” <em>Neurocomputing</em> 31 (1-4): 1–13.</p>
</div>
<div id="ref-2000-chandrasekaran">
<p>Chandrasekaran, Hema, Hung-Han Chen, and Michael T. Manry. 2000. “Pruning of Basis Functions in Nonlinear Approximators.” <em>Neurocomputing</em> 34 (1): 29–53. <a href="https://doi.org/https://doi.org/10.1016/S0925-2312(00)00311-8">https://doi.org/https://doi.org/10.1016/S0925-2312(00)00311-8</a>.</p>
</div>
<div id="ref-2017-changpinyo">
<p>Changpinyo, Soravit, Mark Sandler, and Andrey Zhmoginov. 2017. “The Power of Sparsity in Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1702.06257">http://arxiv.org/abs/1702.06257</a>.</p>
</div>
<div id="ref-2020-chao">
<p>Chao, Shih-Kang, Zhanyu Wang, Yue Xing, and Guang Cheng. 2020. “Directional Pruning of Deep Neural Networks.” <a href="http://arxiv.org/abs/2006.09358">http://arxiv.org/abs/2006.09358</a>.</p>
</div>
<div id="ref-1988-chauvin">
<p>Chauvin, Yves. 1989. “A Back-Propagation Algorithm with Optimal Use of Hidden Units.” In <em>Advances in Neural Information Processing Systems 1</em>, 519–26. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-im2col">
<p>Chellapilla, Kumar, Sidd Puri, and Patrice Simard. 2006. “High Performance Convolutional Neural Networks for Document Processing.” In.</p>
</div>
<div id="ref-2018-chen">
<p>Chen, Chia-Yu, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2017. “AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training.” In <em>32nd Aaai Conference on Artificial Intelligence</em>, 2827–35. <a href="http://arxiv.org/abs/1712.02679">http://arxiv.org/abs/1712.02679</a>.</p>
</div>
<div id="ref-2020-chen-rl">
<p>Chen, Jianda, Shangyu Chen, and Sinno Jialin Pan. 2020. “Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning.” <em>Advances in Neural Information Processing Systems</em> 33.</p>
</div>
<div id="ref-2020-chen">
<p>Chen, Tianlong, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. “The Lottery Ticket Hypothesis for Pre-Trained Bert Networks.” <a href="http://arxiv.org/abs/2007.12223">http://arxiv.org/abs/2007.12223</a>.</p>
</div>
<div id="ref-2016-chen">
<p>Chen, Y., T. Krishna, J. S. Emer, and V. Sze. 2017. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.” <em>IEEE Journal of Solid-State Circuits</em> 52 (1): 127–38. <a href="https://doi.org/10.1109/JSSC.2016.2616357">https://doi.org/10.1109/JSSC.2016.2616357</a>.</p>
</div>
<div id="ref-2019-chen">
<p>Chen, Yu-Hsin, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. “Eyeriss V2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices.” <a href="http://arxiv.org/abs/1807.07928">http://arxiv.org/abs/1807.07928</a>.</p>
</div>
<div id="ref-cheng2020survey">
<p>Cheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2020. “A Survey of Model Compression and Acceleration for Deep Neural Networks.” <a href="http://arxiv.org/abs/1710.09282">http://arxiv.org/abs/1710.09282</a>.</p>
</div>
<div id="ref-2019-abdellatif">
<p>Chérief-Abdellatif, Badr-Eddine. 2019. “Convergence Rates of Variational Inference in Sparse Deep Learning.” <a href="http://arxiv.org/abs/1908.04847">http://arxiv.org/abs/1908.04847</a>.</p>
</div>
<div id="ref-2014-chetlur">
<p>Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. “cuDNN: Efficient Primitives for Deep Learning.” <a href="http://arxiv.org/abs/1410.0759">http://arxiv.org/abs/1410.0759</a>.</p>
</div>
<div id="ref-child2019generating">
<p>Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. “Generating Long Sequences with Sparse Transformers.” <a href="http://arxiv.org/abs/1904.10509">http://arxiv.org/abs/1904.10509</a>.</p>
</div>
<div id="ref-2020-cho">
<p>Cho, Minsu, Ameya Joshi, and Chinmay Hegde. 2020. “ESPN: Extremely Sparse Pruned Networks.” <a href="http://arxiv.org/abs/2006.15741">http://arxiv.org/abs/2006.15741</a>.</p>
</div>
<div id="ref-choudhary2020comprehensive">
<p>Choudhary, Tejalal, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. “A Comprehensive Survey on Model Compression and Acceleration.” <em>Artificial Intelligence Review</em>, 1–43.</p>
</div>
<div id="ref-1996-cibas">
<p>Cibas, Tautvydas, Françoise Fogelman Soulié, Patrick Gallinari, and Sarunas Raudys. 1996. “Variable Selection with Neural Networks.” <em>Neurocomputing</em> 12 (2): 223–48. <a href="https://doi.org/https://doi.org/10.1016/0925-2312(95)00121-2">https://doi.org/https://doi.org/10.1016/0925-2312(95)00121-2</a>.</p>
</div>
<div id="ref-2016-cohen">
<p>Cohen, Joseph Paul, Henry Z. Lo, and Wei Ding. 2017. “RandomOut: Using a Convolutional Gradient Norm to Rescue Convolutional Filters.” <a href="http://arxiv.org/abs/1602.05931">http://arxiv.org/abs/1602.05931</a>.</p>
</div>
<div id="ref-2014-collins">
<p>Collins, Maxwell D., and Pushmeet Kohli. 2014. “Memory Bounded Deep Convolutional Networks.” <em>CoRR</em> abs/1412.1442. <a href="http://arxiv.org/abs/1412.1442">http://arxiv.org/abs/1412.1442</a>.</p>
</div>
<div id="ref-2019-correia">
<p>Correia, Gonçalo M, Vlad Niculae, and André FT Martins. 2019. “Adaptively Sparse Transformers.” In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp)</em>. <a href="http://arxiv.org/abs/1909.00015">http://arxiv.org/abs/1909.00015</a>.</p>
</div>
<div id="ref-cosentino2019search">
<p>Cosentino, Justin, Federico Zaiter, Dan Pei, and Jun Zhu. 2019. “The Search for Sparse, Robust Neural Networks.” <a href="http://arxiv.org/abs/1912.02386">http://arxiv.org/abs/1912.02386</a>.</p>
</div>
<div id="ref-2019-cui">
<p>Cui, Baiyun, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. “Fine-Tune BERT with Sparse Self-Attention Mechanism.” In <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (Emnlp-Ijcnlp)</em>, 3539–44.</p>
</div>
<div id="ref-2018-dai">
<p>Dai, Bin, Chen Zhu, and David Wipf. 2018. “Compressing Neural Networks Using the Variational Information Bottleneck.” <a href="http://arxiv.org/abs/1802.10399">http://arxiv.org/abs/1802.10399</a>.</p>
</div>
<div id="ref-2019-dai">
<p>Dai, Xiaoliang, Hongxu Yin, and Niraj K. Jha. 2018. “NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm.” <a href="http://arxiv.org/abs/1711.02017">http://arxiv.org/abs/1711.02017</a>.</p>
</div>
<div id="ref-2020-dascoli">
<p>d’Ascoli, Stéphane, Levent Sagun, Joan Bruna, and Giulio Biroli. 2020. “Finding the Needle in the Haystack with Convolutions: On the Benefits of Architectural Bias.” <a href="http://arxiv.org/abs/1906.06766">http://arxiv.org/abs/1906.06766</a>.</p>
</div>
<div id="ref-dave2020hardware">
<p>Dave, Shail, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. “Hardware Acceleration of Sparse and Irregular Tensor Computations of Ml Models: A Survey and Insights.” <a href="http://arxiv.org/abs/2007.00864">http://arxiv.org/abs/2007.00864</a>.</p>
</div>
<div id="ref-2020-davies">
<p>Davies, Peter, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. 2020. “Distributed Variance Reduction with Optimal Communication.” <a href="http://arxiv.org/abs/2002.09268">http://arxiv.org/abs/2002.09268</a>.</p>
</div>
<div id="ref-9043731">
<p>Deng, L., G. Li, S. Han, L. Shi, and Y. Xie. 2020. “Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey.” <em>Proceedings of the IEEE</em> 108 (4): 485–532. <a href="https://doi.org/10.1109/JPROC.2020.2976475">https://doi.org/10.1109/JPROC.2020.2976475</a>.</p>
</div>
<div id="ref-denil2014predicting">
<p>Denil, Misha, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. 2014. “Predicting Parameters in Deep Learning.” <a href="http://arxiv.org/abs/1306.0543">http://arxiv.org/abs/1306.0543</a>.</p>
</div>
<div id="ref-2014-denton">
<p>Denton, Emily L, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. “Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation.” <em>Advances in Neural Information Processing Systems</em> 27: 1269–77.</p>
</div>
<div id="ref-2019-dettmers">
<p>Dettmers, Tim, and Luke Zettlemoyer. 2019. “Sparse Networks from Scratch: Faster Training Without Losing Performance.” <a href="http://arxiv.org/abs/1907.04840">http://arxiv.org/abs/1907.04840</a>.</p>
</div>
<div id="ref-de2017ultrastructural">
<p>De Vivo, Luisa, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio Tononi, and Chiara Cirelli. 2017. “Ultrastructural Evidence for Synaptic Scaling Across the Wake/Sleep Cycle.” <em>Science</em> 355 (6324): 507–10.</p>
</div>
<div id="ref-2019-devlin">
<p>Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</em>, 4171–86.</p>
</div>
<div id="ref-2018-dey">
<p>Dey, S., K. Huang, P. A. Beerel, and K. M. Chugg. 2019. “Pre-Defined Sparse Neural Networks with Hardware Acceleration.” <em>IEEE Journal on Emerging and Selected Topics in Circuits and Systems</em> 9 (2): 332–45. <a href="https://doi.org/10.1109/JETCAS.2019.2910864">https://doi.org/10.1109/JETCAS.2019.2910864</a>.</p>
</div>
<div id="ref-diering2017homer1a">
<p>Diering, Graham H, Raja S Nirujogi, Richard H Roth, Paul F Worley, Akhilesh Pandey, and Richard L Huganir. 2017. “Homer1a Drives Homeostatic Scaling-down of Excitatory Synapses During Sleep.” <em>Science</em> 355 (6324): 511–15.</p>
</div>
<div id="ref-2019-ding-c-sgd">
<p>Ding, Xiaohan, Guiguang Ding, Yuchen Guo, and Jungong Han. 2019. “Centripetal Sgd for Pruning Very Deep Convolutional Networks with Complicated Structure.” <a href="http://arxiv.org/abs/1904.03837">http://arxiv.org/abs/1904.03837</a>.</p>
</div>
<div id="ref-2019-ding">
<p>Ding, Xiaohan, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. 2019. “Global Sparse Momentum Sgd for Pruning Very Deep Neural Networks.” <a href="http://arxiv.org/abs/1909.12778">http://arxiv.org/abs/1909.12778</a>.</p>
</div>
<div id="ref-2005-dolan">
<p>Dolan, William B, and Chris Brockett. 2005. “Automatically Constructing a Corpus of Sentential Paraphrases.” In <em>Proceedings of the Third International Workshop on Paraphrasing (Iwp2005)</em>.</p>
</div>
<div id="ref-domingos2020model">
<p>Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” <a href="http://arxiv.org/abs/2012.00152">http://arxiv.org/abs/2012.00152</a>.</p>
</div>
<div id="ref-2019-dong">
<p>Dong, Xiao, Lei Liu, Guangli Li, Jiansong Li, Peng Zhao, Xueying Wang, and Xiaobing Feng. 2019. “Exploiting the Input Sparsity to Accelerate Deep Neural Networks: Poster.” In <em>Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Ppopp 2019, Washington, Dc, Usa, February 16-20, 2019</em>, 401–2. <a href="https://doi.org/10.1145/3293883.3295713">https://doi.org/10.1145/3293883.3295713</a>.</p>
</div>
<div id="ref-2017-dong">
<p>Dong, Xin, Shangyu Chen, and Sinno Jialin Pan. 2017. “Learning to Prune Deep Neural Networks via Layer-Wise Optimal Brain Surgeon.” <a href="http://arxiv.org/abs/1705.07565">http://arxiv.org/abs/1705.07565</a>.</p>
</div>
<div id="ref-2020-dosovitskiy">
<p>Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In <em>Proceedings of the Ninth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/2010.11929">http://arxiv.org/abs/2010.11929</a>.</p>
</div>
<div id="ref-2016-dryden">
<p>Dryden, Nikoli, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. “Communication Quantization for Data-Parallel Training of Deep Neural Networks.” In <em>2nd Workshop on Machine Learning in Hpc Environments (Mlhpc)</em>, 1–8.</p>
</div>
<div id="ref-du2019gradient">
<p>Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. “Gradient Descent Provably Optimizes over-Parameterized Neural Networks.” <a href="http://arxiv.org/abs/1810.02054">http://arxiv.org/abs/1810.02054</a>.</p>
</div>
<div id="ref-2020-dutta">
<p>Dutta, Aritra, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. “On the Discrepancy Between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning.” In <em>Proceedings of the Aaai Conference on Artificial Intelligence</em>, 34:3817–24. 04. <a href="http://arxiv.org/abs/1911.08250">http://arxiv.org/abs/1911.08250</a>.</p>
</div>
<div id="ref-2020-elsen">
<p>Elsen, Erich, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2019. “Fast Sparse Convnets.” <a href="http://arxiv.org/abs/1911.09723">http://arxiv.org/abs/1911.09723</a>.</p>
</div>
<div id="ref-elsken2019neural">
<p>Elsken, Thomas, Jan Hendrik Metzen, and Frank Hutter. 2019. “Neural Architecture Search: A Survey.” <a href="http://arxiv.org/abs/1808.05377">http://arxiv.org/abs/1808.05377</a>.</p>
</div>
<div id="ref-1995-engelbrecht">
<p>Engelbrecht, Andries Petrus, Ian Cloete, and Jacek M Zurada. 1995. “Determining the Significance of Input Parameters Using Sensitivity Analysis.” In <em>International Workshop on Artificial Neural Networks</em>, 382–88. Springer.</p>
</div>
<div id="ref-2001-engelbrecht">
<p>Engelbrecht, A. P. 2001. “A New Pruning Heuristic Based on Variance Analysis of Sensitivity Information.” <em>IEEE Transactions on Neural Networks</em> 12 (6): 1386–99. <a href="https://doi.org/10.1109/72.963775">https://doi.org/10.1109/72.963775</a>.</p>
</div>
<div id="ref-1996-engelbrecht">
<p>Engelbrecht, A. P., and I. Cloete. 1996. “A Sensitivity Analysis Algorithm for Pruning Feedforward Neural Networks.” In <em>Proceedings of International Conference on Neural Networks (Icnn’96)</em>, 2:1274–8 vol.2. <a href="https://doi.org/10.1109/ICNN.1996.549081">https://doi.org/10.1109/ICNN.1996.549081</a>.</p>
</div>
<div id="ref-2020-evci">
<p>Evci, Utku, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. “Rigging the Lottery: Making All Tickets Winners.” <a href="http://arxiv.org/abs/1911.11134">http://arxiv.org/abs/1911.11134</a>.</p>
</div>
<div id="ref-2020-evci-gradient-flow">
<p>Evci, Utku, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. 2020. “Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win.” <a href="http://arxiv.org/abs/2010.03533">http://arxiv.org/abs/2010.03533</a>.</p>
</div>
<div id="ref-2020-evci-difficult">
<p>Evci, Utku, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. 2020. “The Difficulty of Training Sparse Neural Networks.” <a href="http://arxiv.org/abs/1906.10732">http://arxiv.org/abs/1906.10732</a>.</p>
</div>
<div id="ref-2020-fan">
<p>Fan, Angela, Edouard Grave, and Armand Joulin. 2020. “Reducing Transformer Depth on Demand with Structured Dropout.” In <em>Proceedings of the Eighth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1909.11556">http://arxiv.org/abs/1909.11556</a>.</p>
</div>
<div id="ref-2021-fedus">
<p>Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” <a href="http://arxiv.org/abs/2101.03961">http://arxiv.org/abs/2101.03961</a>.</p>
</div>
<div id="ref-1992-finnoff">
<p>Finnoff, William, Ferdinand Hergert, and Hans Georg Zimmermann. 1993. “Improving Model Selection by Nonconvergent Methods.” <em>Neural Networks</em> 6 (6): 771–83.</p>
</div>
<div id="ref-1998-fletcher">
<p>Fletcher, L., V. Katkovnik, F. E. Steffens, and A. P. Engelbrecht. 1998. “Optimizing the Number of Hidden Nodes of a Feedforward Artificial Neural Network.” In <em>1998 Ieee International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227)</em>, 2:1608–12 vol.2. <a href="https://doi.org/10.1109/IJCNN.1998.686018">https://doi.org/10.1109/IJCNN.1998.686018</a>.</p>
</div>
<div id="ref-2019-frankle">
<p>Frankle, Jonathan, and Michael Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” <a href="http://arxiv.org/abs/1803.03635">http://arxiv.org/abs/1803.03635</a>.</p>
</div>
<div id="ref-2020-frankle-linear">
<p>Frankle, Jonathan, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020a. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” <a href="http://arxiv.org/abs/1912.05671">http://arxiv.org/abs/1912.05671</a>.</p>
</div>
<div id="ref-2019-frankle-b">
<p>———. 2020b. “Stabilizing the Lottery Ticket Hypothesis.” <a href="http://arxiv.org/abs/1903.01611">http://arxiv.org/abs/1903.01611</a>.</p>
</div>
<div id="ref-2020-frankle-missing">
<p>———. 2021. “Pruning Neural Networks at Initialization: Why Are We Missing the Mark?” <a href="http://arxiv.org/abs/2009.08576">http://arxiv.org/abs/2009.08576</a>.</p>
</div>
<div id="ref-2020-frankle-early">
<p>Frankle, Jonathan, David J. Schwab, and Ari S. Morcos. 2020. “The Early Phase of Neural Network Training.” <a href="http://arxiv.org/abs/2002.10365">http://arxiv.org/abs/2002.10365</a>.</p>
</div>
<div id="ref-sparse_group_lasso">
<p>Friedman, J., T. Hastie, and R. Tibshirani. 2010. “A Note on the Group Lasso and a Sparse Group Lasso.” <a href="http://arxiv.org/abs/1001.0736">http://arxiv.org/abs/1001.0736</a>.</p>
</div>
<div id="ref-karl_hierarchical_models">
<p>Friston, K.J. 2008. “Hierarchical Models in the Brain.” <em>PLOS Computational Biology</em> 4 (11): e1000211. <a href="https://doi.org/10.1371/journal.pcbi.1000211">https://doi.org/10.1371/journal.pcbi.1000211</a>.</p>
</div>
<div id="ref-2019-gaier">
<p>Gaier, Adam, and David Ha. 2019. “Weight Agnostic Neural Networks.” <a href="http://arxiv.org/abs/1906.04358">http://arxiv.org/abs/1906.04358</a>.</p>
</div>
<div id="ref-2016-gal">
<p>Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In <em>Proceedings of the 33rd International Conference on Machine Learning</em>, edited by Maria Florina Balcan and Kilian Q. Weinberger, 48:1050–9. Proceedings of Machine Learning Research. New York, New York, USA: PMLR. <a href="http://proceedings.mlr.press/v48/gal16.html">http://proceedings.mlr.press/v48/gal16.html</a>.</p>
</div>
<div id="ref-2017-gal">
<p>Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” In <em>Advances in Neural Information Processing Systems</em>, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:3581–90. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2017/file/84ddfb34126fc3a48ee38d7044e87276-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/84ddfb34126fc3a48ee38d7044e87276-Paper.pdf</a>.</p>
</div>
<div id="ref-2019-gale">
<p>Gale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of Sparsity in Deep Neural Networks.” <a href="http://arxiv.org/abs/1902.09574">http://arxiv.org/abs/1902.09574</a>.</p>
</div>
<div id="ref-2020-gale">
<p>Gale, Trevor, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. “Sparse Gpu Kernels for Deep Learning.” <a href="http://arxiv.org/abs/2006.10901">http://arxiv.org/abs/2006.10901</a>.</p>
</div>
<div id="ref-2020-ganesh">
<p>Ganesh, Prakhar, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. “Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.” <a href="http://arxiv.org/abs/2002.11985">http://arxiv.org/abs/2002.11985</a>.</p>
</div>
<div id="ref-ge2011note">
<p>Ge, Dongdong, Xiaoye Jiang, and Yinyu Ye. 2011. “A Note on the Complexity of L P Minimization.” <em>Mathematical Programming</em> 129 (2): 285–99.</p>
</div>
<div id="ref-2019-georgiadis">
<p>Georgiadis, Georgios. 2019. “Accelerating Convolutional Neural Networks via Activation Map Compression.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 7085–95.</p>
</div>
<div id="ref-2018-ghiasi">
<p>Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V Le. 2018. “DropBlock: A Regularization Method for Convolutional Networks.” In <em>Advances in Neural Information Processing Systems</em>, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 31:10727–37. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2018/file/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Paper.pdf">https://proceedings.neurips.cc/paper/2018/file/7edcfb2d8f6a659ef4cd1e6c9b6d7079-Paper.pdf</a>.</p>
</div>
<div id="ref-1995-ghosh">
<p>Ghosh, Joydeep, and Kagan Tumer. 1994. “Structural Adaptation and Generalization in Supervised Feed-Forward Networks.” <em>J. Artif. Neural Netw.</em> 1 (4): 431–58.</p>
</div>
<div id="ref-2010-glorot-init">
<p>Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In <em>AISTATS</em>, edited by Yee Whye Teh and D. Mike Titterington, 9:249–56. JMLR Proceedings. JMLR.org. <a href="http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GlorotB10">http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GlorotB10</a>.</p>
</div>
<div id="ref-2011-glorot">
<p>Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011a. “Deep Sparse Rectifier Neural Networks.” In <em>Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</em>, 315–23.</p>
</div>
<div id="ref-glorot2011deep">
<p>———. 2011b. “Deep Sparse Rectifier Neural Networks.” In <em>Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</em>, 315–23.</p>
</div>
<div id="ref-2019-golub">
<p>Golub, Maximilian, Guy Lemieux, and Mieszko Lis. 2019. “Full Deep Neural Network Training on a Pruned Weight Budget.” <a href="http://arxiv.org/abs/1806.06949">http://arxiv.org/abs/1806.06949</a>.</p>
</div>
<div id="ref-2019-gomez">
<p>Gomez, Aidan N., Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. “Learning Sparse Networks Using Targeted Dropout.” <a href="http://arxiv.org/abs/1905.13678">http://arxiv.org/abs/1905.13678</a>.</p>
</div>
<div id="ref-2020-gondimalla">
<p>Gondimalla, Ashish, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. “SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks.” In <em>Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture</em>, 151–65. MICRO ’52. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3352460.3358291">https://doi.org/10.1145/3352460.3358291</a>.</p>
</div>
<div id="ref-goodfellow2014generative">
<p>Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” <a href="http://arxiv.org/abs/1406.2661">http://arxiv.org/abs/1406.2661</a>.</p>
</div>
<div id="ref-2014-goodfellow">
<p>Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In <em>Advances in Neural Information Processing Systems</em>, 2672–80. <a href="http://arxiv.org/abs/1406.2661">http://arxiv.org/abs/1406.2661</a>.</p>
</div>
<div id="ref-gopalakrishnan2018combating">
<p>Gopalakrishnan, Soorya, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. 2018. “Combating Adversarial Attacks Using Sparse Representations.” <a href="http://arxiv.org/abs/1803.03880">http://arxiv.org/abs/1803.03880</a>.</p>
</div>
<div id="ref-2018-gordon">
<p>Gordon, Ariel, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. 2018. “Morphnet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks.” In <em>Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition</em>, 1586–95.</p>
</div>
<div id="ref-2020-gordon">
<p>Gordon, Mitchell A., Kevin Duh, and Nicholas Andrews. 2020. “Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.” In <em>Proceedings of the 5th Workshop on Representation Learning for Nlp</em>, 143–55. <a href="http://arxiv.org/abs/2002.08307">http://arxiv.org/abs/2002.08307</a>.</p>
</div>
<div id="ref-2020-groenquist">
<p>Grönquist, Peter, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoefler. 2020. “Deep Learning for Post-Processing Ensemble Weather Forecasts.” <a href="http://arxiv.org/abs/2005.08748">http://arxiv.org/abs/2005.08748</a>.</p>
</div>
<div id="ref-UsingAdvancedMPI">
<p>Gropp, William, Torsten Hoefler, Rajeev Thakur, and E. Lusk. 2014. <em>Using Advanced MPI: Modern Features of the Message-Passing Interface</em>. Cambridge, MA: MIT Press.</p>
</div>
<div id="ref-gropp-datatype-performance">
<p>Gropp, William, Torsten Hoefler, Rajeev Thakur, and Jesper Larsson Träff. 2011. “Performance Expectations and Guidelines for MPI Derived Datatypes.” In <em>Recent Advances in the Message Passing Interface (Eurompi’11)</em>, 6960:150–59. Santorini, Greece: Springer.</p>
</div>
<div id="ref-mdl">
<p>Grunwald, Peter. 2004. “A Tutorial Introduction to the Minimum Description Length Principle.” <a href="http://arxiv.org/abs/math/0406077">http://arxiv.org/abs/math/0406077</a>.</p>
</div>
<div id="ref-2007-grunwald">
<p>Grünwald, Peter D. 2007. <em>The Minimum Description Length Principle</em>. MIT press.</p>
</div>
<div id="ref-grunwald2007minimum">
<p>Grünwald, Peter D, and Abhijit Grunwald. 2007. <em>The Minimum Description Length Principle</em>. MIT press.</p>
</div>
<div id="ref-2018-gudovskiy">
<p>Gudovskiy, Denis, Alec Hodgkinson, and Luca Rigazio. 2018. “DNN Feature Map Compression Using Learned Representation over Gf (2).” In <em>Proceedings of the European Conference on Computer Vision (Eccv)</em>, 0–0.</p>
</div>
<div id="ref-2020-guerra">
<p>Guerra, Luis, Bohan Zhuang, Ian Reid, and Tom Drummond. 2020. “Automatic Pruning for Quantized Neural Networks.” <a href="http://arxiv.org/abs/2002.00523">http://arxiv.org/abs/2002.00523</a>.</p>
</div>
<div id="ref-2020-guo">
<p>Guo, Demi, Alexander M. Rush, and Yoon Kim. 2020. “Parameter-Efficient Transfer Learning with Diff Pruning.” <a href="http://arxiv.org/abs/2012.07463">http://arxiv.org/abs/2012.07463</a>.</p>
</div>
<div id="ref-2019-guo">
<p>Guo, Fu-Ming, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019. “Reweighted Proximal Pruning for Large-Scale Language Representation.” <a href="http://arxiv.org/abs/1909.12486">http://arxiv.org/abs/1909.12486</a>.</p>
</div>
<div id="ref-guo2019startransformer">
<p>Guo, Qipeng, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. “Star-Transformer.” In <em>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</em>, 1315–25. <a href="http://arxiv.org/abs/1902.09113">http://arxiv.org/abs/1902.09113</a>.</p>
</div>
<div id="ref-2016-guo">
<p>Guo, Yiwen, Anbang Yao, and Yurong Chen. 2016. “Dynamic Network Surgery for Efficient Dnns.” <a href="http://arxiv.org/abs/1608.04493">http://arxiv.org/abs/1608.04493</a>.</p>
</div>
<div id="ref-guo2018sparse">
<p>Guo, Yiwen, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. “Sparse Dnns with Improved Adversarial Robustness.” In <em>Advances in Neural Information Processing Systems</em>, 242–51.</p>
</div>
<div id="ref-2020-gupta">
<p>Gupta, Manish, and Puneet Agrawal. 2020. “Compression of Deep Learning Models for Text: A Survey.” <a href="http://arxiv.org/abs/2008.05221">http://arxiv.org/abs/2008.05221</a>.</p>
</div>
<div id="ref-2019-gupta">
<p>Gupta, Udit, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, and David Brooks. 2019. “MASR: A Modular Accelerator for Sparse Rnns.” <a href="http://arxiv.org/abs/1908.08976">http://arxiv.org/abs/1908.08976</a>.</p>
</div>
<div id="ref-1993-hagiwara">
<p>Hagiwara, Masafumi. 1993. “Removal of Hidden Units and Weights for Back Propagation Networks.” In <em>Proceedings of 1993 International Conference on Neural Networks (Ijcnn-93-Nagoya, Japan)</em>, 1:351–54. IEEE.</p>
</div>
<div id="ref-1994-hagiwara">
<p>———. 1994. “A Simple and Effective Method for Removal of Hidden Units and Weights.” <em>Neurocomputing</em> 6 (2): 207–18. <a href="https://doi.org/https://doi.org/10.1016/0925-2312(94)90055-8">https://doi.org/https://doi.org/10.1016/0925-2312(94)90055-8</a>.</p>
</div>
<div id="ref-2012-han">
<p>Han, Hong-Gui, and Jun-Fei Qiao. 2013. “A Structure Optimisation Algorithm for Feedforward Neural Network Construction.” <em>Neurocomputing</em> 99: 347–57.</p>
</div>
<div id="ref-2016-han-ese">
<p>Han, Song, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, et al. 2017. “ESE: Efficient Speech Recognition Engine with Sparse Lstm on Fpga.” <a href="http://arxiv.org/abs/1612.00694">http://arxiv.org/abs/1612.00694</a>.</p>
</div>
<div id="ref-2016-han-eie">
<p>Han, Song, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.” <a href="http://arxiv.org/abs/1602.01528">http://arxiv.org/abs/1602.01528</a>.</p>
</div>
<div id="ref-2015-han">
<p>Han, Song, Huizi Mao, and William J. Dally. 2016. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” <a href="http://arxiv.org/abs/1510.00149">http://arxiv.org/abs/1510.00149</a>.</p>
</div>
<div id="ref-2017-han">
<p>Han, Song, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, et al. 2017. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks.” <a href="http://arxiv.org/abs/1607.04381">http://arxiv.org/abs/1607.04381</a>.</p>
</div>
<div id="ref-2015-han-learning">
<p>Han, Song, Jeff Pool, John Tran, and William Dally. 2015. “Learning Both Weights and Connections for Efficient Neural Network.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:1135–43. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf">https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf</a>.</p>
</div>
<div id="ref-1994-hansen">
<p>Hansen, Lars Kai, and others. 1994. “Controlled Growth of Cascade Correlation Nets.” In <em>International Conference on Artificial Neural Networks</em>, 797–800. Springer.</p>
</div>
<div id="ref-NIPS1988_1c9ac015">
<p>Hanson, Stephen, and Lorien Pratt. 1989. “Comparing Biases for Minimal Network Construction with Back-Propagation.” In <em>Advances in Neural Information Processing Systems</em>, edited by D. Touretzky, 1:177–85. Morgan-Kaufmann. <a href="https://proceedings.neurips.cc/paper/1988/file/1c9ac0159c94d8d0cbedc973445af2da-Paper.pdf">https://proceedings.neurips.cc/paper/1988/file/1c9ac0159c94d8d0cbedc973445af2da-Paper.pdf</a>.</p>
</div>
<div id="ref-1992-hassibi">
<p>Hassibi, Babak, and David G. Stork. 1992. “Second Order Derivatives for Network Pruning: Optimal Brain Surgeon.” In <em>Advances in Neural Information Processing Systems 5, [Nips Conference]</em>, 164–71. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2017-hawkins">
<p>Hawkins, J. 2017. “Special Report : Can We Copy the Brain? - What Intelligent Machines Need to Learn from the Neocortex.” <em>IEEE Spectrum</em> 54 (6): 34–71. <a href="https://doi.org/10.1109/MSPEC.2017.7934229">https://doi.org/10.1109/MSPEC.2017.7934229</a>.</p>
</div>
<div id="ref-2020-hayou">
<p>Hayou, Soufiane, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2020. “Pruning Untrained Neural Networks: Principles and Analysis.” <a href="http://arxiv.org/abs/2002.08797">http://arxiv.org/abs/2002.08797</a>.</p>
</div>
<div id="ref-2015-he-init">
<p>He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.” <a href="http://arxiv.org/abs/1502.01852">http://arxiv.org/abs/1502.01852</a>.</p>
</div>
<div id="ref-2017-he-mask">
<p>He, K., G. Gkioxari, P. Dollár, and R. Girshick. 2017. “Mask R-Cnn.” In <em>2017 Ieee International Conference on Computer Vision (Iccv)</em>, 2980–8. <a href="https://doi.org/10.1109/ICCV.2017.322">https://doi.org/10.1109/ICCV.2017.322</a>.</p>
</div>
<div id="ref-2016-he">
<p>He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” In <em>IEEE Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 770–78.</p>
</div>
<div id="ref-2019-he-fpgm">
<p>He, Yang, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. “Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration.” <a href="http://arxiv.org/abs/1811.00250">http://arxiv.org/abs/1811.00250</a>.</p>
</div>
<div id="ref-2019-he">
<p>He, Yihui, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2019. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices.” <a href="http://arxiv.org/abs/1802.03494">http://arxiv.org/abs/1802.03494</a>.</p>
</div>
<div id="ref-2017-he">
<p>He, Yihui, Xiangyu Zhang, and Jian Sun. 2017. “Channel Pruning for Accelerating Very Deep Neural Networks.” <a href="http://arxiv.org/abs/1707.06168">http://arxiv.org/abs/1707.06168</a>.</p>
</div>
<div id="ref-hebb-organization-of-behavior-1949">
<p>Hebb, Donald O. 1949. <em>The Organization of Behavior: A Neuropsychological Theory</em>. New York: Hardcover; Wiley.</p>
</div>
<div id="ref-2020-hegde">
<p>Hegde, Kartik, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. “ExTensor: An Accelerator for Sparse Tensor Algebra.” In <em>Proceedings of the 52nd Annual Ieee/Acm International Symposium on Microarchitecture</em>, 319–33. MICRO ’52. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3352460.3358275">https://doi.org/10.1145/3352460.3358275</a>.</p>
</div>
<div id="ref-2019-hendrycks-imagenetc">
<p>Hendrycks, Dan, and Thomas Dietterich. 2019. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.” In <em>Proceedings of the Seventh International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1903.12261">http://arxiv.org/abs/1903.12261</a>.</p>
</div>
<div id="ref-2019-hendrycks-imageneta">
<p>Hendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. “Natural Adversarial Examples.” <a href="http://arxiv.org/abs/1907.07174">http://arxiv.org/abs/1907.07174</a>.</p>
</div>
<div id="ref-Herculano-Houzel19008">
<p>Herculano-Houzel, Suzana, Bruno Mota, Peiyan Wong, and Jon H. Kaas. 2010. “Connectivity-Driven White Matter Scaling and Folding in Primate Cerebral Cortex.” <em>Proceedings of the National Academy of Sciences</em> 107 (44): 19008–13. <a href="https://doi.org/10.1073/pnas.1012590107">https://doi.org/10.1073/pnas.1012590107</a>.</p>
</div>
<div id="ref-2017-hill">
<p>Hill, P., A. Jain, M. Hill, B. Zamirai, C. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars. 2017. “DeftNN: Addressing Bottlenecks for Dnn Execution on Gpus via Synapse Vector Elimination and Near-Compute Data Fission.” In <em>2017 50th Annual Ieee/Acm International Symposium on Microarchitecture (Micro)</em>, 786–99.</p>
</div>
<div id="ref-2012-hinton">
<p>Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” <a href="http://arxiv.org/abs/1207.0580">http://arxiv.org/abs/1207.0580</a>.</p>
</div>
<div id="ref-1993-hinton">
<p>Hinton, Geoffrey E, and Drew Van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In <em>Proceedings of the Sixth Annual Conference on Computational Learning Theory</em>, 5–13.</p>
</div>
<div id="ref-hinton2015distilling">
<p>Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” <a href="http://arxiv.org/abs/1503.02531">http://arxiv.org/abs/1503.02531</a>.</p>
</div>
<div id="ref-benchmarking">
<p>Hoefler, Torsten, and Roberto Belli. 2015. “Scientific Benchmarking of Parallel Computing Systems.” In, 73:1–73:12. Austin, TX, USA: ACM.</p>
</div>
<div id="ref-2019-hooker">
<p>Hooker, Sara, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. 2019. “What Do Compressed Deep Neural Networks Forget?” <a href="http://arxiv.org/abs/1911.05248">http://arxiv.org/abs/1911.05248</a>.</p>
</div>
<div id="ref-2020-hooker">
<p>Hooker, Sara, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. “Characterising Bias in Compressed Models.” <a href="http://arxiv.org/abs/2010.03058">http://arxiv.org/abs/2010.03058</a>.</p>
</div>
<div id="ref-howard2017mobilenets">
<p>Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” <a href="http://arxiv.org/abs/1704.04861">http://arxiv.org/abs/1704.04861</a>.</p>
</div>
<div id="ref-hoyer2004non">
<p>Hoyer, Patrik O. 2004. “Non-Negative Matrix Factorization with Sparseness Constraints.” <em>Journal of Machine Learning Research</em> 5 (Nov): 1457–69.</p>
</div>
<div id="ref-2016-hu">
<p>Hu, Hengyuan, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. “Network Trimming: A Data-Driven Neuron Pruning Approach Towards Efficient Deep Architectures.” <a href="http://arxiv.org/abs/1607.03250">http://arxiv.org/abs/1607.03250</a>.</p>
</div>
<div id="ref-2020-hu">
<p>Hu, Yuwei, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020. “FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems.” In <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>. SC ’20. Atlanta, Georgia: IEEE Press.</p>
</div>
<div id="ref-2016-huang">
<p>Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In <em>Computer Vision – Eccv 2016</em>, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Cham: Springer International Publishing.</p>
</div>
<div id="ref-2018-huang">
<p>Huang, Zehao, and Naiyan Wang. 2018. “Data-Driven Sparse Structure Selection for Deep Neural Networks.” <a href="http://arxiv.org/abs/1707.01213">http://arxiv.org/abs/1707.01213</a>.</p>
</div>
<div id="ref-2019-huang">
<p>Huang, Ziyue, Wang Yilei, Ke Yi, and others. 2019. “Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation.” In <em>Advances in Neural Information Processing Systems</em>, 6371–81.</p>
</div>
<div id="ref-hubara-bin">
<p>Hubara, Itay, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. “Binarized Neural Networks.” In <em>Proceedings of the 30th International Conference on Neural Information Processing Systems</em>, 4114–22. NIPS’16. Red Hook, NY, USA: Curran Associates Inc.</p>
</div>
<div id="ref-iandola2016squeezenet">
<p>Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.” <a href="http://arxiv.org/abs/1602.07360">http://arxiv.org/abs/1602.07360</a>.</p>
</div>
<div id="ref-2015-ioffe">
<p>Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” <a href="http://arxiv.org/abs/1502.03167">http://arxiv.org/abs/1502.03167</a>.</p>
</div>
<div id="ref-ivanov2020data">
<p>Ivanov, Andrei, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2020. “Data Movement Is All You Need: A Case Study on Optimizing Transformers.” <a href="http://arxiv.org/abs/2007.00072">http://arxiv.org/abs/2007.00072</a>.</p>
</div>
<div id="ref-2019-ivkin">
<p>Ivkin, Nikita, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, and others. 2019. “Communication-Efficient Distributed SGD with Sketching.” In <em>Advances in Neural Information Processing Systems</em>, 13144–54. <a href="http://arxiv.org/abs/1903.04488">http://arxiv.org/abs/1903.04488</a>.</p>
</div>
<div id="ref-jacobs1991adaptive">
<p>Jacobs, Robert A, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. “Adaptive Mixtures of Local Experts.” <em>Neural Computation</em> 3 (1): 79–87.</p>
</div>
<div id="ref-jan2019iwslt">
<p>Jan, Niehues, Roldano Cattoni, Stuker Sebastian, Matteo Negri, Marco Turchi, Salesky Elizabeth, Sanabria Ramon, Barrault Loic, Specia Lucia, and Marcello Federico. 2019. “The Iwslt 2019 Evaluation Campaign.” In <em>16th International Workshop on Spoken Language Translation 2019</em>.</p>
</div>
<div id="ref-1989-janowsky">
<p>Janowsky, Steven A. 1989. “Pruning Versus Clipping in Neural Networks.” <em>Physical Review A</em> 39 (12): 6600.</p>
</div>
<div id="ref-2020-jayakumar">
<p>Jayakumar, Siddhant, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. 2020. “Top-Kast: Top-K Always Sparse Training.” <em>Advances in Neural Information Processing Systems</em> 33.</p>
</div>
<div id="ref-2018-jiang">
<p>Jiang, Peng, and Gagan Agrawal. 2018. “A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication.” In <em>Advances in Neural Information Processing Systems</em>, 2525–36.</p>
</div>
<div id="ref-2019-jin">
<p>Jin, Sian, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. “DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression.” In <em>Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing</em>, 159–70. HPDC ’19. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3307681.3326608">https://doi.org/10.1145/3307681.3326608</a>.</p>
</div>
<div id="ref-2016-jin">
<p>Jin, Xiaojie, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. 2016. “Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods.” <a href="http://arxiv.org/abs/1607.05423">http://arxiv.org/abs/1607.05423</a>.</p>
</div>
<div id="ref-jones2006cognitive">
<p>Jones, Sari, Lars Nyberg, Johan Sandblom, Anna Stigsdotter Neely, Martin Ingvar, Karl Magnus Petersson, and Lars Bäckman. 2006. “Cognitive and Neural Plasticity in Aging: General and Task-Specific Limitations.” <em>Neuroscience & Biobehavioral Reviews</em> 30 (6): 864–71.</p>
</div>
<div id="ref-jordan1994hierarchical">
<p>Jordan, Michael I, and Robert A Jacobs. 1994. “Hierarchical Mixtures of Experts and the Em Algorithm.” <em>Neural Computation</em> 6 (2): 181–214.</p>
</div>
<div id="ref-2020-jorge">
<p>Jorge, Pau de, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, and Puneet K. Dokania. 2020. “Progressive Skeletonization: Trimming More Fat from a Network at Initialization.” <a href="http://arxiv.org/abs/2006.09081">http://arxiv.org/abs/2006.09081</a>.</p>
</div>
<div id="ref-kalchbrenner2018efficient">
<p>Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” <a href="http://arxiv.org/abs/1802.08435">http://arxiv.org/abs/1802.08435</a>.</p>
</div>
<div id="ref-1991-kameyama">
<p>Kameyama, K., and Y. Kosugi. 1991. “Automatic Fusion and Splitting of Artificial Neural Elements in Optimizing the Network Size.” In <em>Conference Proceedings 1991 Ieee International Conference on Systems, Man, and Cybernetics</em>, 1633–8 vol.3. <a href="https://doi.org/10.1109/ICSMC.1991.169926">https://doi.org/10.1109/ICSMC.1991.169926</a>.</p>
</div>
<div id="ref-2020-kang">
<p>Kang, Minsoo, and Bohyung Han. 2020. “Operation-Aware Soft Channel Pruning Using Differentiable Masks.” <a href="http://arxiv.org/abs/2007.03938">http://arxiv.org/abs/2007.03938</a>.</p>
</div>
<div id="ref-1993-kanjilal">
<p>Kanjilal, P. P., P. K. Dey, and D. N. Banerjee. 1993. “Reduced-Size Neural Networks Through Singular Value Decomposition and Subset Selection.” <em>Electronics Letters</em> 29 (17): 1516–8. <a href="https://doi.org/10.1049/el:19931010">https://doi.org/10.1049/el:19931010</a>.</p>
</div>
<div id="ref-kaplan2020scaling">
<p>Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” <a href="http://arxiv.org/abs/2001.08361">http://arxiv.org/abs/2001.08361</a>.</p>
</div>
<div id="ref-2019-karimireddy">
<p>Karimireddy, Sai Praneeth, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. 2019. “Error Feedback Fixes SignSGD and Other Gradient Compression Schemes.” In <em>Proceedings of the Thirty-Sixth International Conference on Machine Learning</em>, 3252–61. <a href="http://arxiv.org/abs/1901.09847">http://arxiv.org/abs/1901.09847</a>.</p>
</div>
<div id="ref-1990-karnin">
<p>Karnin, E. D. 1990. “A Simple Procedure for Pruning Back-Propagation Trained Neural Networks.” <em>IEEE Transactions on Neural Networks</em> 1 (2): 239–42. <a href="https://doi.org/10.1109/72.80236">https://doi.org/10.1109/72.80236</a>.</p>
</div>
<div id="ref-Kerr14063">
<p>Kerr, Jason N. D., David Greenberg, and Fritjof Helmchen. 2005. “Imaging Input and Output of Neocortical Networks in Vivo.” <em>Proceedings of the National Academy of Sciences</em> 102 (39): 14063–8. <a href="https://doi.org/10.1073/pnas.0506029102">https://doi.org/10.1073/pnas.0506029102</a>.</p>
</div>
<div id="ref-2017-kim">
<p>Kim, D., J. Ahn, and S. Yoo. 2018. “ZeNA: Zero-Aware Neural Network Accelerator.” <em>IEEE Design Test</em> 35 (1): 39–46. <a href="https://doi.org/10.1109/MDAT.2017.2741463">https://doi.org/10.1109/MDAT.2017.2741463</a>.</p>
</div>
<div id="ref-2015-kingma">
<p>Kingma, Diederik P, Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick.” In <em>Advances in Neural Information Processing Systems</em>, edited by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, 28:2575–83. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf">https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf</a>.</p>
</div>
<div id="ref-2013-kingma">
<p>Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” <a href="http://arxiv.org/abs/1312.6114">http://arxiv.org/abs/1312.6114</a>.</p>
</div>
<div id="ref-2019-kodryan">
<p>Kodryan, Maxim, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. 2019. “Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks.” In <em>Proceedings of the 4th Workshop on Representation Learning for Nlp (Repl4nlp-2019)</em>, 40–48.</p>
</div>
<div id="ref-2018-konecny">
<p>Konečnỳ, Jakub, and Peter Richtárik. 2018. “Randomized Distributed Mean Estimation: Accuracy Vs. Communication.” <em>Frontiers in Applied Mathematics and Statistics</em> 4: 62. <a href="http://arxiv.org/abs/1611.07555">http://arxiv.org/abs/1611.07555</a>.</p>
</div>
<div id="ref-10.5555/2986916.2987033">
<p>Krogh, Anders, and John A. Hertz. 1991. “A Simple Weight Decay Can Improve Generalization.” In <em>Proceedings of the 4th International Conference on Neural Information Processing Systems</em>, 950–57. NIPS’91. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2017-krueger">
<p>Krueger, David, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. 2017. “Zoneout: Regularizing Rnns by Randomly Preserving Hidden Activations.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2018-kung">
<p>Kung, H. T., Bradley McDanel, and Sai Qian Zhang. 2018. “Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization.” <a href="http://arxiv.org/abs/1811.04770">http://arxiv.org/abs/1811.04770</a>.</p>
</div>
<div id="ref-2019-kunstner">
<p>Kunstner, Frederik, Philipp Hennig, and Lukas Balles. 2019. “Limitations of the Empirical Fisher Approximation for Natural Gradient Descent.” In <em>Advances in Neural Information Processing Systems</em>, 4156–67.</p>
</div>
<div id="ref-2020-kurtz">
<p>Kurtz, Mark, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. 2020. “Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks.” In <em>International Conference on Machine Learning</em>, 5533–43. PMLR.</p>
</div>
<div id="ref-2020-kusupati">
<p>Kusupati, Aditya, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. 2020. “Soft Threshold Weight Reparameterization for Learnable Sparsity.” <a href="http://arxiv.org/abs/2002.03231">http://arxiv.org/abs/2002.03231</a>.</p>
</div>
<div id="ref-2019-kuzmin">
<p>Kuzmin, Andrey, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. 2019. “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1912.09802">http://arxiv.org/abs/1912.09802</a>.</p>
</div>
<div id="ref-kwiatkowski2019natural">
<p>Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, et al. 2019. “Natural Questions: A Benchmark for Question Answering Research.” <em>Transactions of the Association for Computational Linguistics</em> 7: 453–66.</p>
</div>
<div id="ref-2019-lample">
<p>Lample, Guillaume, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. “Large Memory Layers with Product Keys.” <a href="http://arxiv.org/abs/1907.05242">http://arxiv.org/abs/1907.05242</a>.</p>
</div>
<div id="ref-2017-larsson">
<p>Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. 2017. “FractalNet: Ultra-Deep Neural Networks Without Residuals.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2006-lauret">
<p>Lauret, Philippe, Eric Fock, and Thierry Alex Mara. 2006. “A Node Pruning Algorithm Based on a Fourier Amplitude Sensitivity Test Method.” <em>IEEE Transactions on Neural Networks</em> 17 (2): 273–93.</p>
</div>
<div id="ref-7780804">
<p>Lavin, A., and S. Gray. 2016. “Fast Algorithms for Convolutional Neural Networks.” In <em>2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 4013–21. <a href="https://doi.org/10.1109/CVPR.2016.435">https://doi.org/10.1109/CVPR.2016.435</a>.</p>
</div>
<div id="ref-2015-lebedev">
<p>Lebedev, Vadim, and Victor Lempitsky. 2015. “Fast Convnets Using Group-Wise Brain Damage.” <a href="http://arxiv.org/abs/1506.02515">http://arxiv.org/abs/1506.02515</a>.</p>
</div>
<div id="ref-1990-lecun">
<p>Le Cun, Yann, John S. Denker, and Sara A. Solla. 1990. “Optimal Brain Damage.” In <em>Advances in Neural Information Processing Systems 2</em>, 598–605. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.</p>
</div>
<div id="ref-2019-lee-init">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. 2020. “A Signal Propagation Perspective for Pruning Neural Networks at Initialization.” <a href="http://arxiv.org/abs/1906.06307">http://arxiv.org/abs/1906.06307</a>.</p>
</div>
<div id="ref-2019-lee">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity.” <a href="http://arxiv.org/abs/1810.02340">http://arxiv.org/abs/1810.02340</a>.</p>
</div>
<div id="ref-2020-lee">
<p>Lee, Namhoon, Thalaiyasingam Ajanthan, Philip H. S. Torr, and Martin Jaggi. 2020. “Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training.” <a href="http://arxiv.org/abs/2003.11316">http://arxiv.org/abs/2003.11316</a>.</p>
</div>
<div id="ref-lepikhin2020gshard">
<p>Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” <a href="http://arxiv.org/abs/2006.16668">http://arxiv.org/abs/2006.16668</a>.</p>
</div>
<div id="ref-2017-li">
<p>Li, Hao, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. “Pruning Filters for Efficient Convnets.” <a href="http://arxiv.org/abs/1608.08710">http://arxiv.org/abs/1608.08710</a>.</p>
</div>
<div id="ref-2019-li">
<p>Li, J., S. Jiang, S. Gong, J. Wu, J. Yan, G. Yan, and X. Li. 2019. “SqueezeFlow: A Sparse Cnn Accelerator Exploiting Concise Convolution Rules.” <em>IEEE Transactions on Computers</em> 68 (11): 1663–77. <a href="https://doi.org/10.1109/TC.2019.2924215">https://doi.org/10.1109/TC.2019.2924215</a>.</p>
</div>
<div id="ref-2020-li-sac">
<p>Li, Xiaoya, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020. “SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection.” <a href="http://arxiv.org/abs/2003.09833">http://arxiv.org/abs/2003.09833</a>.</p>
</div>
<div id="ref-li2020explaining">
<p>Li, Yuanzhi, Colin Wei, and Tengyu Ma. 2020. “Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks.” <a href="http://arxiv.org/abs/1907.04595">http://arxiv.org/abs/1907.04595</a>.</p>
</div>
<div id="ref-2020-li-bits">
<p>Li, Yunqiang, Silvia Laura Pintea, and Jan van Gemert. 2021. “Less Bits Is More: How Pruning Deep Binary Networks Increases Weight Capacity.” <a href="https://openreview.net/forum?id=Hy8JM_Fvt5N">https://openreview.net/forum?id=Hy8JM_Fvt5N</a>.</p>
</div>
<div id="ref-2020-li">
<p>Li, Zhuohan, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020. “Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers.” <a href="http://arxiv.org/abs/2002.11794">http://arxiv.org/abs/2002.11794</a>.</p>
</div>
<div id="ref-2019-lieberwein">
<p>Liebenwein, Lucas, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. 2020. “Provable Filter Pruning for Efficient Neural Networks.” <a href="http://arxiv.org/abs/1911.07412">http://arxiv.org/abs/1911.07412</a>.</p>
</div>
<div id="ref-lillicrap2019continuous">
<p>Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2019. “Continuous Control with Deep Reinforcement Learning.” <a href="http://arxiv.org/abs/1509.02971">http://arxiv.org/abs/1509.02971</a>.</p>
</div>
<div id="ref-lillicrap2020backpropagation">
<p>Lillicrap, Timothy P, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. “Backpropagation and the Brain.” <em>Nature Reviews Neuroscience</em>, 1–12.</p>
</div>
<div id="ref-2019-lim">
<p>Lim, Hyeontaek, David Andersen, and Michael Kaminsky. 2019. “3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning.” In <em>Proceedings of the Conference on Systems and Machine Learning</em>. <a href="http://arxiv.org/abs/1802.07389">http://arxiv.org/abs/1802.07389</a>.</p>
</div>
<div id="ref-2017-lin">
<p>Lin, Ji, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. “Runtime Neural Pruning.” In <em>Advances in Neural Information Processing Systems</em>, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 30:2181–91. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/a51fb975227d6640e4fe47854476d133-Paper.pdf</a>.</p>
</div>
<div id="ref-2020-lin">
<p>Lin, Tao, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. “Dynamic Model Pruning with Feedback.” <a href="http://arxiv.org/abs/2006.07253">http://arxiv.org/abs/2006.07253</a>.</p>
</div>
<div id="ref-2018-lin">
<p>Lin, Yujun, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2018. “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.” In <em>Proceedings of the Sixth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1712.01887">http://arxiv.org/abs/1712.01887</a>.</p>
</div>
<div id="ref-2020-lin-tformer">
<p>Lin, Zi, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. 2020. “Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior.” In <em>Findings of the Association for Computational Linguistics: EMNLP 2020</em>, 719–30. <a href="http://arxiv.org/abs/2010.01791">http://arxiv.org/abs/2010.01791</a>.</p>
</div>
<div id="ref-lison2019open">
<p>Lison, Pierre, Jörg Tiedemann, Milen Kouylekov, and others. 2019. “Open Subtitles 2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.” In <em>LREC 2018, Eleventh International Conference on Language Resources and Evaluation</em>. European Language Resources Association (ELRA).</p>
</div>
<div id="ref-2015-liu">
<p>Liu, Baoyuan, Min Wang, H. Foroosh, M. Tappen, and M. Penksy. 2015. “Sparse Convolutional Neural Networks.” In <em>2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)</em>, 806–14. <a href="https://doi.org/10.1109/CVPR.2015.7298681">https://doi.org/10.1109/CVPR.2015.7298681</a>.</p>
</div>
<div id="ref-liu2018dynamic">
<p>Liu, Lanlan, and Jia Deng. 2018. “Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-Offs by Selective Execution.” <a href="http://arxiv.org/abs/1701.00299">http://arxiv.org/abs/1701.00299</a>.</p>
</div>
<div id="ref-2019-liu-dynamic">
<p>Liu, Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. 2019. “Dynamic Sparse Graph for Efficient Deep Learning.” <a href="http://arxiv.org/abs/1810.00859">http://arxiv.org/abs/1810.00859</a>.</p>
</div>
<div id="ref-2020-liu">
<p>Liu, Tianlin, and Friedemann Zenke. 2020. “Finding Trainable Sparse Networks Through Neural Tangent Transfer.” <a href="http://arxiv.org/abs/2006.08228">http://arxiv.org/abs/2006.08228</a>.</p>
</div>
<div id="ref-2018-liu-winograd">
<p>Liu, Xingyu, Jeff Pool, Song Han, and William J. Dally. 2018. “Efficient Sparse-Winograd Convolutional Neural Networks.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-2019-liu">
<p>Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” <a href="http://arxiv.org/abs/1907.11692">http://arxiv.org/abs/1907.11692</a>.</p>
</div>
<div id="ref-2017-liu">
<p>Liu, Zhuang, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. “Learning Efficient Convolutional Networks Through Network Slimming.” <a href="http://arxiv.org/abs/1708.06519">http://arxiv.org/abs/1708.06519</a>.</p>
</div>
<div id="ref-2018-liu">
<p>Liu, Zhuang, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019. “Rethinking the Value of Network Pruning.” <a href="http://arxiv.org/abs/1810.05270">http://arxiv.org/abs/1810.05270</a>.</p>
</div>
<div id="ref-2015-liu-celeba">
<p>Liu, Ziwei, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. “Deep Learning Face Attributes in the Wild.” In <em>Proceedings of the Ieee International Conference on Computer Vision</em>, 3730–8. <a href="http://arxiv.org/abs/1411.7766">http://arxiv.org/abs/1411.7766</a>.</p>
</div>
<div id="ref-2018-lobacheva">
<p>Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2018. “Bayesian Sparsification of Gated Recurrent Neural Networks.” <a href="http://arxiv.org/abs/1812.05692">http://arxiv.org/abs/1812.05692</a>.</p>
</div>
<div id="ref-2019-loshchilov">
<p>Loshchilov, Ilya, and Frank Hutter. 2019. “Decoupled Weight Decay Regularization.” In <em>Proceedings of the Seventh International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1711.05101">http://arxiv.org/abs/1711.05101</a>.</p>
</div>
<div id="ref-2017-louizos-bayes">
<p>Louizos, Christos, Karen Ullrich, and Max Welling. 2017. “Bayesian Compression for Deep Learning.” <a href="http://arxiv.org/abs/1705.08665">http://arxiv.org/abs/1705.08665</a>.</p>
</div>
<div id="ref-2018-louizos">
<p>Louizos, Christos, Max Welling, and Diederik P. Kingma. 2018. “Learning Sparse Neural Networks Through <span class="math inline"><em>L</em><sub>0</sub></span> Regularization.” <a href="http://arxiv.org/abs/1712.01312">http://arxiv.org/abs/1712.01312</a>.</p>
</div>
<div id="ref-2019-luo">
<p>Luo, Jian-Hao, and Jianxin Wu. 2019. “AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference.” <a href="http://arxiv.org/abs/1805.08941">http://arxiv.org/abs/1805.08941</a>.</p>
</div>
<div id="ref-2017-luo">
<p>Luo, Jian-Hao, Jianxin Wu, and Weiyao Lin. 2017. “ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression.” <a href="http://arxiv.org/abs/1707.06342">http://arxiv.org/abs/1707.06342</a>.</p>
</div>
<div id="ref-ly2017tutorial">
<p>Ly, Alexander, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. 2017. “A Tutorial on Fisher Information.” <a href="http://arxiv.org/abs/1705.01064">http://arxiv.org/abs/1705.01064</a>.</p>
</div>
<div id="ref-2019-lym">
<p>Lym, Sangkug, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. “PruneTrain.” <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>, November. <a href="https://doi.org/10.1145/3295500.3356156">https://doi.org/10.1145/3295500.3356156</a>.</p>
</div>
<div id="ref-2020-madaan">
<p>Madaan, Divyam, Jinwoo Shin, and Sung Ju Hwang. 2020. “Adversarial Neural Pruning with Latent Vulnerability Suppression.” <a href="http://arxiv.org/abs/1908.04355">http://arxiv.org/abs/1908.04355</a>.</p>
</div>
<div id="ref-2017-maddison">
<p>Maddison, Chris J., Andriy Mnih, and Yee Whye Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” <em>International Conference on Learning Representations (ICLR)</em>.</p>
</div>
<div id="ref-makhzani2015winnertakeall">
<p>Makhzani, Alireza, and Brendan Frey. 2015. “Winner-Take-All Autoencoders.” <a href="http://arxiv.org/abs/1409.2752">http://arxiv.org/abs/1409.2752</a>.</p>
</div>
<div id="ref-2020-malach">
<p>Malach, Eran, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. 2020. “Proving the Lottery Ticket Hypothesis: Pruning Is All You Need.” <a href="http://arxiv.org/abs/2002.00585">http://arxiv.org/abs/2002.00585</a>.</p>
</div>
<div id="ref-2018-malaviya">
<p>Malaviya, Chaitanya, Pedro Ferreira, and André FT Martins. 2018. “Sparse and Constrained Attention for Neural Machine Translation.” In <em>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>. <a href="http://arxiv.org/abs/1805.08241">http://arxiv.org/abs/1805.08241</a>.</p>
</div>
<div id="ref-2018-mallya">
<p>Mallya, Arun, and Svetlana Lazebnik. 2018. “PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning.” <a href="http://arxiv.org/abs/1711.05769">http://arxiv.org/abs/1711.05769</a>.</p>
</div>
<div id="ref-2018-manessi">
<p>Manessi, Franco, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schettini. 2018. “Automated Pruning for Deep Neural Network Compression.” <em>2018 24th International Conference on Pattern Recognition (ICPR)</em>, August. <a href="https://doi.org/10.1109/icpr.2018.8546129">https://doi.org/10.1109/icpr.2018.8546129</a>.</p>
</div>
<div id="ref-2017-mao">
<p>Mao, Huizi, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. “Exploring the Regularity of Sparse Structure in Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1705.08922">http://arxiv.org/abs/1705.08922</a>.</p>
</div>
<div id="ref-2015-mariet">
<p>Mariet, Zelda, and Suvrit Sra. 2017. “Diversity Networks: Neural Network Compression Using Determinantal Point Processes.” <a href="http://arxiv.org/abs/1511.05077">http://arxiv.org/abs/1511.05077</a>.</p>
</div>
<div id="ref-2015-martens">
<p>Martens, James, and Roger Grosse. 2015. “Optimizing Neural Networks with Kronecker-Factored Approximate Curvature.” <a href="http://arxiv.org/abs/1503.05671">http://arxiv.org/abs/1503.05671</a>.</p>
</div>
<div id="ref-2016-martins">
<p>Martins, Andre, and Ramon Astudillo. 2016. “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification.” In <em>International Conference on Machine Learning</em>, 1614–23. <a href="http://arxiv.org/abs/1602.02068">http://arxiv.org/abs/1602.02068</a>.</p>
</div>
<div id="ref-2019-mattson">
<p>Mattson, Peter, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, et al. 2020. “MLPerf Training Benchmark.” <a href="http://arxiv.org/abs/1910.01500">http://arxiv.org/abs/1910.01500</a>.</p>
</div>
<div id="ref-mccandlish2018empirical">
<p>McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. “An Empirical Model of Large-Batch Training.” <a href="http://arxiv.org/abs/1812.06162">http://arxiv.org/abs/1812.06162</a>.</p>
</div>
<div id="ref-2019-mccarley">
<p>McCarley, J. S., Rishav Chakravarti, and Avirup Sil. 2020. “Structured Pruning of a BERT-Based Question Answering Model.” <a href="http://arxiv.org/abs/1910.06360">http://arxiv.org/abs/1910.06360</a>.</p>
</div>
<div id="ref-2019-mehta">
<p>Mehta, Rahul. 2019. “Sparse Transfer Learning via Winning Lottery Tickets.” <a href="http://arxiv.org/abs/1905.07785">http://arxiv.org/abs/1905.07785</a>.</p>
</div>
<div id="ref-2020-meng">
<p>Meng, Fanxu, Hao Cheng, Ke Li, Huixiang Luo, Xiaowei Guo, Guangming Lu, and Xing Sun. 2020. “Pruning Filter in Filter.” <a href="http://arxiv.org/abs/2009.14410">http://arxiv.org/abs/2009.14410</a>.</p>
</div>
<div id="ref-mhaskar2016deep">
<p>Mhaskar, Hrushikesh, and Tomaso Poggio. 2016. “Deep Vs. Shallow Networks : An Approximation Theory Perspective.” <a href="http://arxiv.org/abs/1608.03287">http://arxiv.org/abs/1608.03287</a>.</p>
</div>
<div id="ref-2019-michel">
<p>Michel, Paul, Omer Levy, and Graham Neubig. 2019. “Are Sixteen Heads Really Better Than One?” <a href="http://arxiv.org/abs/1905.10650">http://arxiv.org/abs/1905.10650</a>.</p>
</div>
<div id="ref-millidge2020predictive">
<p>Millidge, Beren, Alexander Tschantz, and Christopher L. Buckley. 2020. “Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs.” <a href="http://arxiv.org/abs/2006.04182">http://arxiv.org/abs/2006.04182</a>.</p>
</div>
<div id="ref-2017-mishra">
<p>Mishra, Asit K., Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. “WRPN: Wide Reduced-Precision Networks.” <em>CoRR</em> abs/1709.01134. <a href="http://arxiv.org/abs/1709.01134">http://arxiv.org/abs/1709.01134</a>.</p>
</div>
<div id="ref-2018-mittal">
<p>Mittal, Deepak, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. 2018. “Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1801.10447">http://arxiv.org/abs/1801.10447</a>.</p>
</div>
<div id="ref-2018-miyato">
<p>Miyato, Takeru, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. “Spectral Normalization for Generative Adversarial Networks.” In <em>Proceedings of the Sixth International Conference on Learning Representations</em>. <a href="http://arxiv.org/abs/1802.05957">http://arxiv.org/abs/1802.05957</a>.</p>
</div>
<div id="ref-2018-mocanu">
<p>Mocanu, Decebal Constantin, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. “Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science.” <em>Nature Communications</em> 9 (1): 1–12.</p>
</div>
<div id="ref-2016-molchanov-ard">
<p>Molchanov, Dmitry, Arseniy Ashuha, and Dmitry Vetrov. 2016. “Dropout-Based Automatic Relevance Determination.” In <em>Bayesian Deep Learning Workshop, Nips</em>.</p>
</div>
<div id="ref-2017-molchanov">
<p>Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” <a href="http://arxiv.org/abs/1701.05369">http://arxiv.org/abs/1701.05369</a>.</p>
</div>
<div id="ref-2019-molchanov">
<p>Molchanov, Pavlo, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. “Importance Estimation for Neural Network Pruning.” <a href="http://arxiv.org/abs/1906.10771">http://arxiv.org/abs/1906.10771</a>.</p>
</div>
<div id="ref-2016-molchanov">
<p>Molchanov, Pavlo, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. “Pruning Convolutional Neural Networks for Resource Efficient Inference.” <a href="http://arxiv.org/abs/1611.06440">http://arxiv.org/abs/1611.06440</a>.</p>
</div>
<div id="ref-1991-moody">
<p>Moody, John E. 1991. “Note on Generalization, Regularization and Architecture Selection in Nonlinear Learning Systems.” In <em>Neural Networks for Signal Processing Proceedings of the 1991 Ieee Workshop</em>, 1–10. IEEE.</p>
</div>
<div id="ref-2020-morcos">
<p>Morcos, Ari S., Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. “One Ticket to Win Them All: Generalizing Lottery Ticket Initializations Across Datasets and Optimizers.” <a href="http://arxiv.org/abs/1906.02773">http://arxiv.org/abs/1906.02773</a>.</p>
</div>
<div id="ref-2019-mostafa">
<p>Mostafa, Hesham, and Xin Wang. 2019. “Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization.” <a href="http://arxiv.org/abs/1902.05967">http://arxiv.org/abs/1902.05967</a>.</p>
</div>
<div id="ref-1988-mozer">
<p>Mozer, Michael C, and Paul Smolensky. 1988. “Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment.” <em>Advances in Neural Information Processing Systems</em> 1: 107–15.</p>
</div>
<div id="ref-2011-mrazova">
<p>Mrázová, I., and Z. Reitermanová. 2011. “A New Sensitivity-Based Pruning Technique for Feed-Forward Neural Networks That Improves Generalization.” In <em>The 2011 International Joint Conference on Neural Networks</em>, 2143–50. <a href="https://doi.org/10.1109/IJCNN.2011.6033493">https://doi.org/10.1109/IJCNN.2011.6033493</a>.</p>
</div>
<div id="ref-2006-mukherjee">
<p>Mukherjee, Sayan, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. “Learning Theory: Stability Is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization.” <em>Advances in Computational Mathematics</em> 25 (1-3): 161–93.</p>
</div>
<div id="ref-2020-mussay">
<p>Mussay, Ben, Daniel Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. 2020. “Data-Independent Structured Pruning of Neural Networks via Coresets.” <a href="http://arxiv.org/abs/2008.08316">http://arxiv.org/abs/2008.08316</a>.</p>
</div>
<div id="ref-2017-narang">
<p>Narang, Sharan, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. “Exploring Sparsity in Recurrent Neural Networks.” <a href="http://arxiv.org/abs/1704.05119">http://arxiv.org/abs/1704.05119</a>.</p>
</div>
<div id="ref-2008-narasimha">
<p>Narasimha, Pramod L., Walter H. Delashmit, Michael T. Manry, Jiang Li, and Francisco Maldonado. 2008. “An Integrated Growing-Pruning Method for Feedforward Network Training.” <em>Neurocomputing</em> 71 (13): 2831–47. <a href="https://doi.org/https://doi.org/10.1016/j.neucom.2007.08.026">https://doi.org/https://doi.org/10.1016/j.neucom.2007.08.026</a>.</p>
</div>
<div id="ref-2017-neklyudov">
<p>Neklyudov, Kirill, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Structured Bayesian Pruning via Log-Normal Multiplicative Noise.” <a href="http://arxiv.org/abs/1705.07283">http://arxiv.org/abs/1705.07283</a>.</p>
</div>
<div id="ref-2020-neyshabur">
<p>Neyshabur, Behnam. 2020. “Towards Learning Convolutions from Scratch.” <a href="http://arxiv.org/abs/2007.13657">http://arxiv.org/abs/2007.13657</a>.</p>
</div>
<div id="ref-neyshabur2018understanding">
<p>Neyshabur, Behnam, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. 2018. “Towards Understanding the Role of over-Parametrization in Generalization of Neural Networks.” <a href="http://arxiv.org/abs/1805.12076">http://arxiv.org/abs/1805.12076</a>.</p>
</div>
<div id="ref-2010-ngiam">
<p>Ngiam, J., Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. “Tiled Convolutional Neural Networks.” In <em>Advances in Neural Information Processing Systems 23</em>, 1279–87.</p>
</div>
<div id="ref-2017-niculae">
<p>Niculae, Vlad, and Mathieu Blondel. 2017. “A Regularized Framework for Sparse and Structured Neural Attention.” In <em>Advances in Neural Information Processing Systems</em>, 3338–48. <a href="http://arxiv.org/abs/1705.07704">http://arxiv.org/abs/1705.07704</a>.</p>
</div>
<div id="ref-2009-nilsson">
<p>Nilsson, Nils J. 2009. <em>The Quest for Artificial Intelligence: A History of Ideas and Achievements</em>. Cambridge University Press.</p>
</div>
<div id="ref-2020-niu">
<p>Niu, Yue, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. 2020. “Reuse Kernels or Activations? A Flexible Dataflow for Low-Latency Spectral Cnn Acceleration.” In <em>Proceedings of the 2020 Acm/Sigda International Symposium on Field-Programmable Gate Arrays</em>, 266–76. FPGA ’20. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3373087.3375302">https://doi.org/10.1145/3373087.3375302</a>.</p>
</div>
<div id="ref-2019-niu">
<p>Niu, Yue, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal Kannan, Yanzhi Wang, and Viktor Prasanna. 2019. “SPEC2: SPECtral Sparse Cnn Accelerator on Fpgas.” <a href="http://arxiv.org/abs/1910.11103">http://arxiv.org/abs/1910.11103</a>.</p>
</div>
<div id="ref-noh2015learning">
<p>Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. 2015. “Learning Deconvolution Network for Semantic Segmentation.” <a href="http://arxiv.org/abs/1505.04366">http://arxiv.org/abs/1505.04366</a>.</p>
</div>
<div id="ref-1992-nowlan">
<p>Nowlan, Steven J, and Geoffrey E Hinton. 1992. “Simplifying Neural Networks by Soft Weight-Sharing.” <em>Neural Computation</em> 4 (4): 473–93.</p>
</div>
<div id="ref-a100">
<p>Nvidia. 2020. “NVIDIA A100 Tensor Core Gpu Architecture.”</p>
</div>
<div id="ref-1996-olshausen">
<p>Olshausen, Bruno A, and David J Field. 1996. “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.” <em>Nature</em> 381 (6583): 607–9.</p>
</div>
<div id="ref-2020-orseau">
<p>Orseau, Laurent, Marcus Hutter, and Omar Rivasplata. 2020. “Logarithmic Pruning Is All You Need.” <a href="http://arxiv.org/abs/2006.12156">http://arxiv.org/abs/2006.12156</a>.</p>
</div>
<div id="ref-2019-osawa">
<p>Osawa, Kazuki, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. “Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks.” <em>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, June. <a href="https://doi.org/10.1109/cvpr.2019.01264">https://doi.org/10.1109/cvpr.2019.01264</a>.</p>
</div>
<div id="ref-2016-pan">
<p>Pan, Wei, Hao Dong, and Yike Guo. 2016. “DropNeuron: Simplifying the Structure of Deep Neural Networks.” <a href="http://arxiv.org/abs/1606.07326">http://arxiv.org/abs/1606.07326</a>.</p>
</div>
<div id="ref-2017-parashar">
<p>Parashar, Angshuman, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. “SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks.” <a href="http://arxiv.org/abs/1708.04485">http://arxiv.org/abs/1708.04485</a>.</p>
</div>
<div id="ref-2016-park">
<p>Park, Jongsoo, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. “Faster Cnns with Direct Sparse Convolutions and Guided Pruning.” <a href="http://arxiv.org/abs/1608.01409">http://arxiv.org/abs/1608.01409</a>.</p>
</div>
<div id="ref-2018-parmar">
<p>Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. “Image Transformer.” In <em>International Conference on Machine Learning</em>, 4055–64. <a href="http://arxiv.org/abs/1802.05751">http://arxiv.org/abs/1802.05751</a>.</p>
</div>
<div id="ref-NIPS1995_3473decc">
<p>Pedersen, Morten, Lars Hansen, and Jan Larsen. 1996. “Pruning with Generalization Based Weight Saliencies: Lambda Obd, Lambda Obs.” In <em>Advances in Neural Information Processing Systems</em>, edited by D. Touretzky, M. C. Mozer, and M. Hasselmo, 8:521–27. MIT Press. <a href="https://proceedings.neurips.cc/paper/1995/file/3473decccb0509fb264818a7512a8b9b-Paper.pdf">https://proceedings.neurips.cc/paper/1995/file/3473decccb0509fb264818a7512a8b9b-Paper.pdf</a>.</p>
</div>
<div id="ref-2020-pensia">
<p>Pensia, Ankit, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. 2020. “Optimal Lottery Tickets via Subsetsum: Logarithmic over-Parameterization Is Sufficient.” <a href="http://arxiv.org/abs/2006.07990">http://arxiv.org/abs/2006.07990</a>.</p>
</div>
<div id="ref-plummer2020shapeshifter">
<p>Plummer, Bryan A., Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. 2020. “Shapeshifter Networks: Cross-Layer Parameter Sharing for Scalable and Effective Deep Learning.” <a href="http://arxiv.org/abs/2006.10598">http://arxiv.org/abs/2006.10598</a>.</p>
</div>
<div id="ref-2015-polyak">
<p>Polyak, A., and L. Wolf. 2015. “Channel-Level Acceleration of Deep Face Representations.” <em>IEEE Access</em> 3: 2163–75. <a href="https://doi.org/10.1109/ACCESS.2015.2494536">https://doi.org/10.1109/ACCESS.2015.2494536</a>.</p>
</div>
<div id="ref-10.1145/356616.356618">
<p>Pooch, Udo W., and Al Nieder. 1973. “A Survey of Indexing Techniques for Sparse Matrices.” <em>ACM Comput. Surv.</em> 5 (2): 109–33. <a href="https://doi.org/10.1145/356616.356618">https://doi.org/10.1145/356616.356618</a>.</p>
</div>
<div id="ref-2017-prabhu">
<p>Prabhu, Ameya, Girish Varma, and Anoop Namboodiri. 2018. “Deep Expander Networks: Efficient Deep Networks from Graph Theory.” <a href="http://arxiv.org/abs/1711.08757">http://arxiv.org/abs/1711.08757</a>.</p>
</div>
<div id="ref-2020-prasanna">
<p>Prasanna, Sai, Anna Rogers, and Anna Rumshisky. 2020. “When BERT Plays the Lottery, All Tickets Are Winning.” <a href="http://arxiv.org/abs/2005.00561">http://arxiv.org/abs/2005.00561</a>.</p>
</div>
<div id="ref-1997-prechelt">
<p>Prechelt, Lutz. 1997. “Connection Pruning with Static and Adaptive Pruning Schedules.” <em>Neurocomputing</em> 16 (1): 49–61. <a href="https://doi.org/https://doi.org/10.1016/S0925-2312(96)00054-9">https://doi.org/https://doi.org/10.1016/S0925-2312(96)00054-9</a>.</p>
</div>
<div id="ref-2020-qin">
<p>Qin, E., A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna. 2020. “SIGMA: A Sparse and Irregular Gemm Accelerator with Flexible Interconnects for Dnn Training.” In <em>2020 Ieee International Symposium on High Performance Computer Architecture (Hpca)</em>, 58–70. <a href="https://doi.org/10.1109/HPCA47549.2020.00015">https://doi.org/10.1109/HPCA47549.2020.00015</a>.</p>
</div>
<div id="ref-2020-raihan">
<p>Raihan, Md Aamir, and Tor M. Aamodt. 2020. “Sparse Weight Activation Training.” <a href="http://arxiv.org/abs/2001.01969">http://arxiv.org/abs/2001.01969</a>.</p>
</div>
<div id="ref-10.1145/3386263.3407651">
<p>Rakin, Adnan Siraj, Zhezhi He, Li Yang, Yanzhi Wang, Liqiang Wang, and Deliang Fan. 2020. “Robust Sparse Regularization: Defending Adversarial Attacks via Regularized Sparse Network.” In <em>Proceedings of the 2020 on Great Lakes Symposium on Vlsi</em>, 125–30. GLSVLSI ’20. New York, NY, USA: Association for Computing Machinery. <a href="https://doi.org/10.1145/3386263.3407651">https://doi.org/10.1145/3386263.3407651</a>.</p>
</div>
<div id="ref-ramanujan2020whats">
<p>Ramanujan, Vivek, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. “What’s Hidden in a Randomly Weighted Neural Network?” <a href="http://arxiv.org/abs/1911.13299">http://arxiv.org/abs/1911.13299</a>.</p>
</div>
<div id="ref-rasmussen2001occam">
<p>Rasmussen, Carl Edward, and Zoubin Ghahramani. 2001. “Occam’s Razor.” In <em>Advances in Neural Information Processing Systems</em>, 294–300.</p>
</div>
<div id="ref-2016-reagen">
<p>Reagen, B., P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G. Wei, and D. Brooks. 2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In <em>2016 Acm/Ieee 43rd Annual International Symposium on Computer Architecture (Isca)</em>, 267–78. <a href="https://doi.org/10.1109/ISCA.2016.32">https://doi.org/10.1109/ISCA.2016.32</a>.</p>
</div>
<div id="ref-1993-reed">
<p>Reed, R. 1993. “Pruning Algorithms-a Survey.” <em>IEEE Transactions on Neural Networks</em> 4 (5): 740–47. <a href="https://doi.org/10.1109/72.248452">https://doi.org/10.1109/72.248452</a>.</p>
</div>
<div id="ref-2020-renda">
<p>Renda, Alex, Jonathan Frankle, and Michael Carbin. 2020. “Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” <a href="http://arxiv.org/abs/2003.02389">http://arxiv.org/abs/2003.02389</a>.</p>
</div>
<div id="ref-2019-renggli">
<p>Renggli, Cèdric, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. “SparCML: High-Performance Sparse Communication for Machine Learning.” In <em>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis</em>, 1–15. <a href="http://arxiv.org/abs/1802.08021">http://arxiv.org/abs/1802.08021</a>.</p>
</div>
<div id="ref-reuther2020survey">
<p>Reuther, Albert, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. “Survey of Machine Learning Accelerators.” <a href="http://arxiv.org/abs/2009.00993">http://arxiv.org/abs/2009.00993</a>.</p>
</div>
<div id="ref-2014-rezende">
<p>Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Variational Inference in Deep Latent Gaussian Models.” In <em>International Conference on Machine Learning</em>. Vol. 2.</p>
</div>
<div id="ref-2018-rhu">
<p>Rhu, Minsoo, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. “Compressing Dma Engine: Leveraging Activation Sparsity for Training Deep Neural Networks.” In <em>2018 Ieee International Symposium on High Performance Computer Architecture (Hpca)</em>, 78–91. IEEE.</p>
</div>
<div id="ref-2021-rogers">
<p>Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. 2021. “A Primer in BERTology: What We Know About How Bert Works.” <em>Transactions of the Association for Computational Linguistics</em> 8: 842–66. <a href="http://arxiv.org/abs/2002.12327">http://arxiv.org/abs/2002.12327</a>.</p>
</div>
<div id="ref-rosenbaum2017routing">
<p>Rosenbaum, Clemens, Tim Klinger, and Matthew Riemer. 2017. “Routing Networks: Adaptive Selection of Non-Linear Functions for Multi-Task Learning.” <a href="http://arxiv.org/abs/1711.01239">http://arxiv.org/abs/1711.01239</a>.</p>