Skip to content

Commit

Permalink
2024-02-08 18:51:18
Browse files Browse the repository at this point in the history
  • Loading branch information
wizardforcel committed Feb 8, 2024
1 parent ee69255 commit 40580a7
Showing 1 changed file with 28 additions and 0 deletions.
28 changes: 28 additions & 0 deletions totrans/fund-dl_13.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2273,11 +2273,13 @@
id: totrans-195
prefs: []
type: TYPE_NORMAL
zh: 我们还可以使用之前讨论过的采样技术来制定一个随机策略,有时会偏离Q函数的建议,以改变我们的代理程序进行探索的程度。
- en: DQN and the Markov Assumption
id: totrans-196
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: DQN和马尔可夫假设
- en: DQN is still a Markov decision process that relies on the *Markov assumption*,
which assumes that the next state <math alttext="s Subscript i plus 1"><msub><mi>s</mi>
<mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></math> depends only on the
Expand All @@ -2292,11 +2294,16 @@
id: totrans-197
prefs: []
type: TYPE_NORMAL
zh: DQN仍然是一个依赖于*马尔可夫假设*的马尔可夫决策过程,该假设假定下一个状态<math alttext="s Subscript i plus 1"><msub><mi>s</mi>
<mrow><mi>i</mi><mo>+</mo><mn>1</mn></mrow></msub></math>仅取决于当前状态<math alttext="s
Subscript i"><msub><mi>s</mi> <mi>i</mi></msub></math>和动作<math alttext="a Subscript
i"><msub><mi>a</mi> <mi>i</mi></msub></math>,而不取决于任何先前的状态或动作。这个假设对于许多环境并不成立,其中游戏状态无法在单个帧中总结。例如,在乒乓球中,球的速度(成功游戏的重要因素)在任何单个游戏帧中都没有被捕捉到。马尔可夫假设使得建模决策过程变得更简单和可靠,但通常会损失建模能力。
- en: DQN’s Solution to the Markov Assumption
id: totrans-198
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: DQN对马尔可夫假设的解决方案
- en: DQN solves this problem by utilizing *state history*. Instead of processing
one game frame as the game’s state, DQN considers the past four game frames as
the game’s current state. This allows DQN to utilize time-dependent information.
Expand All @@ -2305,16 +2312,19 @@
id: totrans-199
prefs: []
type: TYPE_NORMAL
zh: DQN通过利用*状态历史*来解决这个问题。DQN不是将一个游戏帧作为游戏状态,而是将过去四个游戏帧视为游戏的当前状态。这使得DQN能够利用时间相关信息。这有点工程上的技巧,我们将在本章末尾讨论处理状态序列的更好方法。
- en: Playing Breakout with DQN
id: totrans-200
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 使用DQN玩Breakout
- en: 'Let’s pull all of what we learned together and actually go about implementing
DQN to play Breakout. We start out by defining our `DQNAgent`:'
id: totrans-201
prefs: []
type: TYPE_NORMAL
zh: 让我们将我们学到的所有内容整合在一起,实际上开始实施DQN来玩Breakout。我们首先定义我们的`DQNAgent`:
- en: '[PRE7]'
id: totrans-202
prefs: []
Expand All @@ -2325,11 +2335,13 @@
id: totrans-203
prefs: []
type: TYPE_NORMAL
zh: 这个类中有很多内容,让我们在以下部分中逐一解释。
- en: Building Our Architecture
id: totrans-204
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 构建我们的架构
- en: 'We build our two Q-networks: the prediction network and the target Q-network.
Notice how they have the same architecture definition, since they are the same
network, with the target Q just having delayed parameter updates. Since we are
Expand All @@ -2339,11 +2351,13 @@
id: totrans-205
prefs: []
type: TYPE_NORMAL
zh: 我们构建两个Q网络:预测网络和目标Q网络。请注意它们具有相同的架构定义,因为它们是相同的网络,只是目标Q具有延迟的参数更新。由于我们正在学习从纯像素输入中玩Breakout,我们的游戏状态是一个像素数组。我们将这个图像通过三个卷积层,然后两个全连接层,以产生我们每个潜在动作的Q值。
- en: Stacking Frames
id: totrans-206
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 堆叠帧
- en: You may notice that our state input is actually of size `[None, self.history_length,
self.screen_height, self.screen_width]`. Remember, in order to model and capture
time-dependent state variables like speed, DQN uses not just one image, but a
Expand All @@ -2354,16 +2368,20 @@
id: totrans-207
prefs: []
type: TYPE_NORMAL
zh: 您可能注意到我们的状态输入实际上是大小为`[None, self.history_length, self.screen_height, self.screen_width]`。记住,为了建模和捕捉像速度这样的时间相关状态变量,DQN不仅使用一个图像,而是一组连续的图像,也称为*历史*。这些连续的图像中的每一个被视为一个单独的通道。我们使用辅助函数`process_state_into_stacked_frames(self,
frame, past_frames, past_state=None)`构建这些堆叠帧。
- en: Setting Up Training Operations
id: totrans-208
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 设置训练操作
- en: 'Our loss function is derived from our objective expression from earlier in
this chapter:'
id: totrans-209
prefs: []
type: TYPE_NORMAL
zh: 我们的损失函数源自本章前面的目标表达式:
- en: <math><mrow><msub><mo form="prefix" movablelimits="true">min</mo> <mi>θ</mi></msub>
<msub><mo>∑</mo> <mrow><mi>e</mi><mo>∈</mo><mi>E</mi></mrow></msub> <msubsup><mo>∑</mo>
<mrow><mi>t</mi><mo>=</mo><mn>0</mn></mrow> <mi>T</mi></msubsup> <mover accent="true"><mi>Q</mi>
Expand All @@ -2377,6 +2395,16 @@
id: totrans-210
prefs: []
type: TYPE_NORMAL
zh: <math><mrow><msub><mo form="prefix" movablelimits="true">min</mo> <mi>θ</mi></msub>
<msub><mo>∑</mo> <mrow><mi>e</mi><mo>∈</mo><mi>E</mi></mrow></msub> <msubsup><mo>∑</mo>
<mrow><mi>t</mi><mo>=</mo><mn>0</mn></mrow> <mi>T</mi></msubsup> <mover accent="true"><mi>Q</mi>
<mo>^</mo></mover> <mrow><mo>(</mo> <msub><mi>s</mi> <mi>t</mi></msub> <mo>,</mo>
<msub><mi>a</mi> <mi>t</mi></msub> <mo>|</mo> <mi>θ</mi> <mo>)</mo></mrow> <mo>-</mo>
<mfenced close=")" open="(" separators=""><msub><mi>r</mi> <mi>t</mi></msub> <mo>+</mo>
<mi>γ</mi> <msub><mo form="prefix" movablelimits="true">max</mo> <msup><mi>a</mi>
<mo>'</mo></msup></msub> <mover accent="true"><mi>Q</mi> <mo>^</mo></mover> <mrow><mo>(</mo><msub><mi>s</mi>
<mrow><mi>t</mi><mo>+</mo><mn>1</mn></mrow></msub> <mo>,</mo><msup><mi>a</mi>
<mo>'</mo></msup> <mo>|</mo><mi>θ</mi><mo>)</mo></mrow></mfenced></mrow></math>
- en: We want our prediction network to equal our target network, plus the return
at the current time step. We can express this in pure PyTorch code as the difference
between the output of our prediction network and the output of our target network.
Expand Down

0 comments on commit 40580a7

Please sign in to comment.