2024-02-08 18:18:17

OpenDocCN · Feb 8, 2024 · 1f77467 · 1f77467
1 parent 51f110c
commit 1f77467
Showing 1 changed file with 15 additions and 0 deletions.
diff --git a/totrans/fund-dl_06.yaml b/totrans/fund-dl_06.yaml
@@ -749,16 +749,19 @@
   id: totrans-85
   prefs: []
   type: TYPE_NORMAL
+  zh: 现代深度网络优化的一个重大突破是学习率调整的出现。学习率调整背后的基本概念是，在学习的过程中适当地修改最佳学习率，以实现良好的收敛性质。在接下来的几节中，我们将讨论AdaGrad、RMSProp和Adam，这三种最流行的自适应学习率算法之一。
 - en: '![](Images/fdl2_0610.png)'
   id: totrans-86
   prefs: []
   type: TYPE_IMG
+  zh: '![](Images/fdl2_0610.png)'
 - en: Figure 6-10\. The method of steepest descent often zigzags; conjugate descent
     attempts to remedy this issue
   id: totrans-87
   prefs:
   - PREF_H6
   type: TYPE_NORMAL
+  zh: 在其原始形式中，BFGS具有显著的内存占用，但最近的工作已经产生了一个更节省内存的版本，称为*L-BFGS*。[8]
 - en: An alternative optimization algorithm known as the *Broyden–Fletcher–Goldfarb–Shanno*
     (BFGS) algorithm attempts to compute the inverse of the Hessian matrix iteratively
     and use the inverse Hessian to more effectively optimize the parameter vector.^([7](ch06.xhtml#idm45934167533568))
@@ -767,18 +770,21 @@
   id: totrans-88
   prefs: []
   type: TYPE_NORMAL
+  zh: 一种替代优化算法被称为*Broyden–Fletcher–Goldfarb–Shanno*（BFGS）算法，试图迭代计算Hessian矩阵的逆，并使用逆Hessian更有效地优化参数向量。[7]
 - en: In general, while these methods hold some promise, second-order methods are
     still an area of active research and are unpopular among practitioners. PyTorch
     does, however, support L-BFGS as well as other second-order methods, such as Averaged
     Stochastic Gradient Descent, for your own experimentation.
   id: totrans-89
   prefs: []
   type: TYPE_NORMAL
+  zh: 一般来说，虽然这些方法有一些希望，但二阶方法仍然是一个活跃研究领域，且在实践者中不受欢迎。然而，PyTorch支持L-BFGS以及其他二阶方法，如平均随机梯度下降，供您自行尝试。
 - en: Learning Rate Adaptation
   id: totrans-90
   prefs:
   - PREF_H1
   type: TYPE_NORMAL
+  zh: 我们将讨论的第一个算法是AdaGrad，它试图通过历史梯度的累积随时间调整全局学习率，由Duchi等人在2011年首次提出。具体来说，我们跟踪每个参数的学习率。这个学习率与所有参数历史梯度的平方和的平方根成反比地缩放。
 - en: As we have discussed previously, another major challenge for training deep networks
     is appropriately selecting the learning rate. Choosing the correct learning rate
     has long been one of the most troublesome aspects of training deep networks because
@@ -788,6 +794,7 @@
   id: totrans-91
   prefs: []
   type: TYPE_NORMAL
+  zh: 第一个是*共轭梯度下降*，它起源于试图改进最陡下降的朴素方法。在最陡下降中，我们计算梯度的方向，然后进行线性搜索以找到沿该方向的最小值。我们跳到最小值，然后重新计算梯度以确定下一次线性搜索的方向。事实证明，这种方法最终会出现相当多的曲折，如[图6-10](#fig0610)所示，因为每次我们沿最陡下降的方向移动时，我们会在另一个方向上撤销一点进展。解决这个问题的方法是相对于先前选择的方向移动*共轭方向*，而不是最陡下降的方向。通过使用Hessian的间接近似来选择共轭方向，将梯度和我们先前的方向线性组合。稍作修改，这种方法可以推广到我们在深度网络中找到的非凸误差表面。[6]
 - en: One of the major breakthroughs in modern deep network optimization was the advent
     of learning rate adaption. The basic concept behind learning rate adaptation is
     that the optimal learning rate is appropriately modified over the span of learning
@@ -797,11 +804,13 @@
   id: totrans-92
   prefs: []
   type: TYPE_NORMAL
+  zh: 学习率调整
 - en: AdaGrad—Accumulating Historical Gradients
   id: totrans-93
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: AdaGrad-累积历史梯度
 - en: The first algorithm we’ll discuss is AdaGrad, which attempts to adapt the global
     learning rate over time using an accumulation of the historical gradients, first
     proposed by Duchi et al. in 2011.^([9](ch06.xhtml#idm45934166863168)) Specifically,
@@ -811,6 +820,7 @@
   id: totrans-94
   prefs: []
   type: TYPE_NORMAL
+  zh: 图6-10。最陡下降法经常曲折；共轭下降试图解决这个问题
 - en: 'We can express this mathematically. We initialize a gradient accumulation vector 
     <math alttext="bold r 0 equals bold 0"><mrow><msub><mi>𝐫</mi> <mn>0</mn></msub>
     <mo>=</mo> <mn mathvariant="bold">0</mn></mrow></math> . At every step, we accumulate
@@ -819,6 +829,10 @@
   id: totrans-95
   prefs: []
   type: TYPE_NORMAL
+  zh: '正如我们之前讨论过的，训练深度网络的另一个主要挑战是适当选择学习率。选择正确的学习率长期以来一直是训练深度网络中最棘手的方面之一，因为它对网络的性能有重大影响。学习率太小则学习速度不够快，但学习率太大可能在接近局部最小值或病态区域时难以收敛。我们可以用数学方式表达这一点。我们初始化一个梯度累积向量<math
+    alttext="bold r 0 equals bold 0"><mrow><msub><mi>𝐫</mi> <mn>0</mn></msub> <mo>=</mo>
+    <mn mathvariant="bold">0</mn></mrow> </math>。在每一步中，我们按如下方式累积所有梯度参数的平方（其中<math
+    alttext="circled-dot"><mo>⊙</mo></math>操作是逐元素张量乘法）:'
 - en: <math alttext="bold r Subscript i Baseline equals bold r Subscript i minus 1
     Baseline plus bold g Subscript i Baseline circled-dot bold g Subscript i"><mrow><msub><mi>𝐫</mi>
     <mi>i</mi></msub> <mo>=</mo> <msub><mi>𝐫</mi> <mrow><mi>i</mi><mo>-</mo><mn>1</mn></mrow></msub>
@@ -836,6 +850,7 @@
   id: totrans-97
   prefs: []
   type: TYPE_NORMAL
+  zh: 然后我们像往常一样计算更新，只是我们的全局学习率<math alttext="epsilon"><mi>ϵ</mi></math>被梯度累积向量的平方根除以：
 - en: <math alttext="theta Subscript i Baseline equals theta Subscript i minus 1 Baseline
     minus StartFraction epsilon Over delta circled-plus StartRoot bold r Subscript
     i Baseline EndRoot EndFraction circled-dot bold g"><mrow><msub><mi>θ</mi> <mi>i</mi></msub>