From 40580a763323719a8fd8d1a00c41eb7f72518dfb Mon Sep 17 00:00:00 2001
From: wizardforcel <562826179@qq.com>
Date: Thu, 8 Feb 2024 18:51:20 +0800
Subject: [PATCH] 2024-02-08 18:51:18

---
 totrans/fund-dl_13.yaml | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
diff --git a/totrans/fund-dl_13.yaml b/totrans/fund-dl_13.yaml
index be23817..f24fd06 100644
--- a/totrans/fund-dl_13.yaml
+++ b/totrans/fund-dl_13.yaml
@@ -2273,11 +2273,13 @@
   id: totrans-195
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们还可以使用之前讨论过的采样技术来制定一个随机策略，有时会偏离Q函数的建议，以改变我们的代理程序进行探索的程度。
 - en: DQN and the Markov Assumption
   id: totrans-196
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: DQN和马尔可夫假设
 - en: DQN is still a Markov decision process that relies on the *Markov assumption*,
     which assumes that the next state <math alttext="s Subscript i plus 1"><msub><mi>s</mi>
     <mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></math> depends only on the
@@ -2292,11 +2294,16 @@
   id: totrans-197
   prefs: []
   type: TYPE_NORMAL
+  zh: DQN仍然是一个依赖于*马尔可夫假设*的马尔可夫决策过程，该假设假定下一个状态<math alttext="s Subscript i plus 1"><msub><mi>s</mi>
+    <mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></math>仅取决于当前状态<math alttext="s
+    Subscript i"><msub><mi>s</mi> <mi>i</mi></msub></math>和动作<math alttext="a Subscript
+    i"><msub><mi>a</mi> <mi>i</mi></msub></math>，而不取决于任何先前的状态或动作。这个假设对于许多环境并不成立，其中游戏状态无法在单个帧中总结。例如，在乒乓球中，球的速度（成功游戏的重要因素）在任何单个游戏帧中都没有被捕捉到。马尔可夫假设使得建模决策过程变得更简单和可靠，但通常会损失建模能力。
 - en: DQN’s Solution to the Markov Assumption
   id: totrans-198
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: DQN对马尔可夫假设的解决方案
 - en: DQN solves this problem by utilizing *state history*. Instead of processing
     one game frame as the game’s state, DQN considers the past four game frames as
     the game’s current state. This allows DQN to utilize time-dependent information.
@@ -2305,16 +2312,19 @@
   id: totrans-199
   prefs: []
   type: TYPE_NORMAL
+  zh: DQN通过利用*状态历史*来解决这个问题。DQN不是将一个游戏帧作为游戏状态，而是将过去四个游戏帧视为游戏的当前状态。这使得DQN能够利用时间相关信息。这有点工程上的技巧，我们将在本章末尾讨论处理状态序列的更好方法。
 - en: Playing Breakout with DQN
   id: totrans-200
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 使用DQN玩Breakout
 - en: 'Let’s pull all of what we learned together and actually go about implementing
     DQN to play Breakout. We start out by defining our `DQNAgent`:'
   id: totrans-201
   prefs: []
   type: TYPE_NORMAL
+  zh: 让我们将我们学到的所有内容整合在一起，实际上开始实施DQN来玩Breakout。我们首先定义我们的`DQNAgent`：
 - en: '[PRE7]'
   id: totrans-202
   prefs: []
@@ -2325,11 +2335,13 @@
   id: totrans-203
   prefs: []
   type: TYPE_NORMAL
+  zh: 这个类中有很多内容，让我们在以下部分中逐一解释。
 - en: Building Our Architecture
   id: totrans-204
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 构建我们的架构
 - en: 'We build our two Q-networks: the prediction network and the target Q-network.
     Notice how they have the same architecture definition, since they are the same
     network, with the target Q just having delayed parameter updates. Since we are
@@ -2339,11 +2351,13 @@
   id: totrans-205
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们构建两个Q网络：预测网络和目标Q网络。请注意它们具有相同的架构定义，因为它们是相同的网络，只是目标Q具有延迟的参数更新。由于我们正在学习从纯像素输入中玩Breakout，我们的游戏状态是一个像素数组。我们将这个图像通过三个卷积层，然后两个全连接层，以产生我们每个潜在动作的Q值。
 - en: Stacking Frames
   id: totrans-206
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 堆叠帧
 - en: You may notice that our state input is actually of size `[None, self.history_length,
     self.screen_height, self.screen_width]`. Remember, in order to model and capture
     time-dependent state variables like speed, DQN uses not just one image, but a
@@ -2354,16 +2368,20 @@
   id: totrans-207
   prefs: []
   type: TYPE_NORMAL
+  zh: 您可能注意到我们的状态输入实际上是大小为`[None, self.history_length, self.screen_height, self.screen_width]`。记住，为了建模和捕捉像速度这样的时间相关状态变量，DQN不仅使用一个图像，而是一组连续的图像，也称为*历史*。这些连续的图像中的每一个被视为一个单独的通道。我们使用辅助函数`process_state_into_stacked_frames(self,
+    frame, past_frames, past_state=None)`构建这些堆叠帧。
 - en: Setting Up Training Operations
   id: totrans-208
   prefs:
   - PREF_H2
   type: TYPE_NORMAL
+  zh: 设置训练操作
 - en: 'Our loss function is derived from our objective expression from earlier in
     this chapter:'
   id: totrans-209
   prefs: []
   type: TYPE_NORMAL
+  zh: 我们的损失函数源自本章前面的目标表达式：
 - en: <math><mrow><msub><mo form="prefix" movablelimits="true">min</mo> <mi>θ</mi></msub>
     <msub><mo>∑</mo> <mrow><mi>e</mi><mo>∈</mo><mi>E</mi></mrow></msub> <msubsup><mo>∑</mo>
     <mrow><mi>t</mi><mo>=</mo><mn>0</mn></mrow> <mi>T</mi></msubsup> <mover accent="true"><mi>Q</mi>
@@ -2377,6 +2395,16 @@
   id: totrans-210
   prefs: []
   type: TYPE_NORMAL
+  zh: <math><mrow><msub><mo form="prefix" movablelimits="true">min</mo> <mi>θ</mi></msub>
+    <msub><mo>∑</mo> <mrow><mi>e</mi><mo>∈</mo><mi>E</mi></mrow></msub> <msubsup><mo>∑</mo>
+    <mrow><mi>t</mi><mo>=</mo><mn>0</mn></mrow> <mi>T</mi></msubsup> <mover accent="true"><mi>Q</mi>
+    <mo>^</mo></mover> <mrow><mo>(</mo> <msub><mi>s</mi> <mi>t</mi></msub> <mo>,</mo>
+    <msub><mi>a</mi> <mi>t</mi></msub> <mo>|</mo> <mi>θ</mi> <mo>)</mo></mrow> <mo>-</mo>
+    <mfenced close=")" open="(" separators=""><msub><mi>r</mi> <mi>t</mi></msub> <mo>+</mo>
+    <mi>γ</mi> <msub><mo form="prefix" movablelimits="true">max</mo> <msup><mi>a</mi>
+    <mo>'</mo></msup></msub> <mover accent="true"><mi>Q</mi> <mo>^</mo></mover> <mrow><mo>(</mo><msub><mi>s</mi>
+    <mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub> <mo>,</mo><msup><mi>a</mi>
+    <mo>'</mo></msup> <mo>|</mo><mi>θ</mi><mo>)</mo></mrow></mfenced></mrow></math>
 - en: We want our prediction network to equal our target network, plus the return
     at the current time step. We can express this in pure PyTorch code as the difference
     between the output of our prediction network and the output of our target network.