From baceeecae0e4d3111d02a0bdb08a2e18302c7701 Mon Sep 17 00:00:00 2001 From: Zhiqing Xiao Date: Sun, 1 Oct 2023 18:46:47 +0800 Subject: [PATCH] Update codes --- README.md | 2 +- en2023/code.md | 2 +- en2023/code_zh.md | 2 +- en2023/notation.html | 4 ++-- en2023/notation_zh.html | 4 ++-- zh2023/.DS_Store | Bin 6148 -> 6148 bytes zh2023/code.md | 2 +- zh2023/notation.html | 4 ++-- 8 files changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 9d8987c..cc66866 100644 --- a/README.md +++ b/README.md @@ -79,7 +79,7 @@ Note: - 兼容性好:所有代码在三大操作系统(Windows、macOS、Linux)上均可运行,书中给出了环境的安装和配置方法。深度强化学习代码还提供了 TensorFlow 和 PyTorch 对照代码。读者可任选其一。 - 硬件要求低:所有代码均可在没有 GPU 的个人计算机上运行。 -# 强化学习:原理与Python实现 (2018) +# 强化学习:原理与Python实现 (2019) **全球第一本配套 TensorFlow 2 代码的强化学习教程书** diff --git a/en2023/code.md b/en2023/code.md index 318cc38..19050fc 100644 --- a/en2023/code.md +++ b/en2023/code.md @@ -3,7 +3,7 @@ | \# | Caption | | --- | --- | | [Code 1-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Check the observation space and action space of the environment | -| [Code 1-2](https://zhiqingxiao.github.io/rl-book/en2023/codeMountainCar-v0_ClosedForm.html) | Closed-form agent for task `MountainCar-v0` | +| [Code 1-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Closed-form agent for task `MountainCar-v0` | | [Code 1-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Play an episode | | [Code 1-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Test the performance by playing 100 episodes | | [Code 1-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | Check the observation space and action space of the task `MountainCarContinuous-v0` | diff --git a/en2023/code_zh.md b/en2023/code_zh.md index 33e43c6..8df08d7 100644 --- a/en2023/code_zh.md +++ b/en2023/code_zh.md @@ -3,7 +3,7 @@ | \# | 代码内容 | | --- | --- | | [代码1-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 查看`MountainCar-v0`的观测空间和动作空间 | -| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2023/codeMountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | +| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | | [代码1-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 智能体和环境交互一个回合的代码 | | [代码1-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 运行100回合求平均以测试性能 | | [代码1-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | 查看`MountainCarContinuous-v0`的观测空间和动作空间 | diff --git a/en2023/notation.html b/en2023/notation.html index 540565f..cc52304 100644 --- a/en2023/notation.html +++ b/en2023/notation.html @@ -738,9 +738,9 @@ mjx-container[jax="SVG"] path[data-c], mjx-container[jax="SVG"] use[data-c] { stroke-width: 0; } -en2023notation.20230518 +Notations
-

Notation

General rules

  • Upper-case letters are random events or random numbers, while lower-case letters are deterministic events or deterministic variables.
  • The serif typeface, such as X, denotes numerical values. The sans typeface, such as X, denotes events in general, which can be either numerical or not numerical.
  • Bold letters denote vectors (such as w) or matrices (such as F), where matrices are always upper-case, even they are deterministic matrices.
  • Calligraph letters, such as X, denote sets.
  • Fraktur letters, such as 𝔣, denote mappings.

Table

In the sequel are notations throughout the book. We also occasionally follow other notations defined locally.

Latin LettersDescription
A, aadvantage
A, aaction
Aaction space
B, bbaseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, bbelief in partially observable tasks
𝔅π, 𝔟πBellman expectation operator of policy π (upper case only used in distributional RL)
𝔅, 𝔟Bellman optimal operator (upper case only used in distributional RL)
Ba batch of transition generated by experience replay; belief space in partially observable tasks
B+belief space with terminal belief in partially observable tasks
ccounting; coefficients in linear programming
d, dmetrics
dff-divergence
dKLKL divergence
dJSJS divergence
dTVtotal variation
Dtindicator of episode end
Dset of experience
eeligibility trace
Eexpectation
𝔣a mapping
FFisher information matrix
G, greturn
ggradient vector
haction preference
Hentropy
kindex of iteration
loss
pprobability, dynamics
Ptransition matrix
oobservation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~infinite in asymptotic notations
O, oobservation
Prprobability
Q, qaction value
Qπ, qπaction value of policy π (upper case only used in distributional RL)
Q, qoptimal action values (upper case only used in distributional RL)
qvector representation of action values
R, rreward
Rreward space
S, sstate
Sstate space
S+state space with terminal state
Tsteps in an episode
𝔲belief update operator in partially observable tasks
U, uTD target; (lower case only) upper bound
V, vstate value
Vπ, vπstate value of the policy π (upper case only used in distributional RL)
V, voptimal state values (upper case only used in distributional RL)
vvector representation of state values
Varvariance
wparameters of value function estimate
X, xan event
Xevent space
zparameters for eligibility trace
Greek LettersDescription
αlearning rate
βreinforce strength in eligibility trace; distortion function in distributional RL
γdiscount factor
Δ, δTD error
εparameters for exploration
λdecay strength of eligibility trace
Π, πpolicy
πoptimal policy
πEexpert policy in imitation learning
θparameters for policy function estimates
ϑthreshold for value iteration
ρvisitation frequency; important sampling ratio in off-policy learning
ρvector representation of visitation frequency
τ, τsojourn time of SMDP
T, 𝞽trajectory
Ω, ωaccumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
ΨGeneralized Advantage Estimate (GAE)
Other NotationsDescription
=dshare the same distribution
=a.e.equal almost everywhere
<, , , >compare numbers; element-wise comparison
, , , partial order of policy
absolute continuous
empty set
gradient
obey a distribution
||absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set
+

Notation

General rules

Table

In the sequel are notations throughout the book. We also occasionally follow other notations defined locally.

English LettersDescription
A, aadvantage
A, aaction
Aaction space
B, bbaseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, bbelief in partially observable tasks
𝔅π, 𝔟πBellman expectation operator of policy π (upper case only used in distributional RL)
𝔅, 𝔟Bellman optimal operator (upper case only used in distributional RL)
Ba batch of transition generated by experience replay; belief space in partially observable tasks
B+belief space with terminal belief in partially observable tasks
ccounting; coefficients in linear programming
d, dmetrics
dff-divergence
dKLKL divergence
dJSJS divergence
dTVtotal variation
Dtindicator of episode end
Dset of experience
eeligibility trace
Eexpectation
𝔣a mapping
FFisher information matrix
G, greturn
ggradient vector
haction preference
Hentropy
kindex of iteration
loss
pprobability, dynamics
Ptransition matrix
oobservation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~infinite in asymptotic notations
O, oobservation
Prprobability
Q, qaction value
Qπ, qπaction value of policy π (upper case only used in distributional RL)
Q, qoptimal action values (upper case only used in distributional RL)
qvector representation of action values
R, rreward
Rreward space
S, sstate
Sstate space
S+state space with terminal state
Tsteps in an episode
𝔲belief update operator in partially observable tasks
U, uTD target; (lower case only) upper bound
V, vstate value
Vπ, vπstate value of the policy π (upper case only used in distributional RL)
V, voptimal state values (upper case only used in distributional RL)
vvector representation of state values
Varvariance
wparameters of value function estimate
X, xan event
Xevent space
zparameters for eligibility trace
Greek LettersDescription
αlearning rate
βreinforce strength in eligibility trace; distortion function in distributional RL
γdiscount factor
Δ, δTD error
εparameters for exploration
λdecay strength of eligibility trace
Π, πpolicy
πoptimal policy
πEexpert policy in imitation learning
θparameters for policy function estimates
ϑthreshold for value iteration
ρvisitation frequency; important sampling ratio in off-policy learning
ρvector representation of visitation frequency
τ, τsojourn time of SMDP
T, 𝞽trajectory
Ω, ωaccumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
ΨGeneralized Advantage Estimate (GAE)
Other NotationsDescription
=dshare the same distribution
=a.e.equal almost everywhere
<, , , >compare numbers; element-wise comparison
, , , partial order of policy
absolute continuous
empty set
gradient
obey a distribution
||absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set
\ No newline at end of file diff --git a/en2023/notation_zh.html b/en2023/notation_zh.html index af5579a..0e697f7 100644 --- a/en2023/notation_zh.html +++ b/en2023/notation_zh.html @@ -738,9 +738,9 @@ mjx-container[jax="SVG"] path[data-c], mjx-container[jax="SVG"] use[data-c] { stroke-width: 0; } -en2023notation_zh.20230518 +《强化学习:原理与Python实现》数学记号
-

《强化学习:原理与Python实现》字母

一般规律

  • 大写是随机事件或随机变量,小写是确定性事件或确定性变量。
  • 衬线体(如Times New Roman字体,如X)是数值,非衬线体(如Open Sans字体,如X)则不一定是数值。
  • 粗体是向量(如w)或矩阵(如F)(矩阵用大写,即使是确定量也是如此)。
  • 花体(如X)是集合。
  • 哥特体(如 𝔣 )是映射。

字母表

下表列出常用字母。部分小节会局部定义的字母,以该局部定义为准。

拉丁字母含义英文含义
A, a优势advantage
A, a动作action
A动作空间action space
B, b策略梯度算法中的基线;部分可观测任务中的数值化信念;(仅小写)额外量;异策学习时的行为策略baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, b部分可观测任务中的信念belief in partially observable tasks
𝔅π, 𝔟π策略π的Bellman期望算子(大写只用于值分布学习)Bellman expectation operator of policy π (upper case only used in distributional RL)
𝔅, 𝔟Bellman最优算子(大写只用于值分布学习)Bellman optimal operator (upper case only used in distributional RL)
B经验回放中抽取的一批经验;部分可观测任务中的信念空间a batch of transition generated by experience replay; belief space in partially observable tasks
B+部分可观测任务中带终止信念的信念空间belief space with terminal belief in partially observable tasks
c计数值;线性规划的目标系数counting; coefficients in linear programming
d, d度量metrics
dff散度f-divergence
dKLKL散度KL divergence
dJSJS散度JS divergence
dTV全变差total variation
Dt回合结束指示indicator of episode end
D经验集set of experience
e资格迹eligibility trace
E期望expectation
𝔣一般的映射a mapping
FFisher信息矩阵Fisher information matrix
G, g回报return
g梯度向量gradient vector
h动作偏好action preference
Hentropy
k迭代次数指标index of iteration
损失loss
p概率值,动力probability, dynamics
P转移矩阵transition matrix
o部分可观测环境的观测概率;渐近无穷小observation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~渐近无穷大infinite in asymptotic notations
O, o观测observation
Pr概率probability
Q, q动作价值action value
Qπ, qπ策略π的动作价值(大写只用于值分布学习)action value of policy π (upper case only used in distributional RL)
Q, q最优动作价值(大写只用于值分布学习)optimal action values (upper case only used in distributional RL)
q动作价值的向量表示vector representation of action values
R, r奖励reward
R奖励空间reward space
S, s状态state
S状态空间state space
S+带终止状态的状态空间state space with terminal state
T回合步数steps in an episode
𝔲部分可观测任务中的信念更新算子belief update operator in partially observable tasks
U, u用自益得到的回报估计随机变量;小写的u还表示置信上界TD target; (lower case only) upper bound
V, v状态价值state value
Vπ, vπ策略π的状态价值(大写只用于值分布学习)state value of the policy π (upper case only used in distributional RL)
V, v最优状态价值(大写只用于值分布学习)optimal state values (upper case only used in distributional RL)
v状态价值的向量表示vector representation of state values
Var方差variance
w价值估计参数parameters of value function estimate
X, x一般的事件an event
X一般的事件空间event space
z资格迹参数parameters for eligibility trace
希腊字母含义英文含义
α学习率learning rate
β资格迹算法中的强化强度;值分布学习中的扭曲函数reinforce strength in eligibility trace; distortion function in distributional RL
γ折扣因子discount factor
Δ, δ时序差分误差TD error
ε探索参数parameters for exploration
λ资格迹衰减强度decay strength of eligibility trace
Π, π策略policy
π最优策略optimal policy
πE模仿学习中的专家策略expert policy in imitation learning
θ策略估计参数parameters for policy function estimates
ϑ价值迭代终止阈值threshold for value iteration
ρ访问频次;异策算法中的重要性采样比率visitation frequency; important sampling ratio in off-policy learning
ρ访问频次的向量表示vector representation of visitation frequency
τ, τ半Markov决策过程中的逗留时间sojourn time of SMDP
T, 𝞽轨迹trajectory
Ω, ω值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
Ψ扩展的优势估计Generalized Advantage Estimate (GAE)
其他符号含义英文含义
=d分布相同share the same distribution
=a.e.几乎处处相等equal almost everywhere
<, , , >普通数值比较;向量逐元素比较compare numbers; element-wise comparison
, , , 策略的偏序关系partial order of policy
绝对连续absolute continuous
空集empty set
梯度gradient
服从分布obey a distribution
||实数的绝对值;向量或矩阵的逐元素求绝对值;集合的元素个数absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set

 

+

《强化学习:原理与Python实现》数学记号

一般规律

数学记号表

下表列出常用记号。部分小节会有局部定义的记号,以该局部定义为准。

英语字母含义英文含义
A, a优势advantage
A, a动作action
A动作空间action space
B, b策略梯度算法中的基线;部分可观测任务中的数值化信念;(仅小写)额外量;异策学习时的行为策略baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, b部分可观测任务中的信念belief in partially observable tasks
𝔅π, 𝔟π策略π的Bellman期望算子(大写只用于值分布学习)Bellman expectation operator of policy π (upper case only used in distributional RL)
𝔅, 𝔟Bellman最优算子(大写只用于值分布学习)Bellman optimal operator (upper case only used in distributional RL)
B经验回放中抽取的一批经验;部分可观测任务中的信念空间a batch of transition generated by experience replay; belief space in partially observable tasks
B+部分可观测任务中带终止信念的信念空间belief space with terminal belief in partially observable tasks
c计数值;线性规划的目标系数counting; coefficients in linear programming
d, d度量metrics
dff散度f-divergence
dKLKL散度KL divergence
dJSJS散度JS divergence
dTV全变差total variation
Dt回合结束指示indicator of episode end
D经验集set of experience
e资格迹eligibility trace
E期望expectation
𝔣一般的映射a mapping
FFisher信息矩阵Fisher information matrix
G, g回报return
g梯度向量gradient vector
h动作偏好action preference
Hentropy
k迭代次数指标index of iteration
损失loss
p概率值,动力probability, dynamics
P转移矩阵transition matrix
o部分可观测环境的观测概率;渐近无穷小observation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~渐近无穷大infinite in asymptotic notations
O, o观测observation
Pr概率probability
Q, q动作价值action value
Qπ, qπ策略π的动作价值(大写只用于值分布学习)action value of policy π (upper case only used in distributional RL)
Q, q最优动作价值(大写只用于值分布学习)optimal action values (upper case only used in distributional RL)
q动作价值的向量表示vector representation of action values
R, r奖励reward
R奖励空间reward space
S, s状态state
S状态空间state space
S+带终止状态的状态空间state space with terminal state
T回合步数steps in an episode
𝔲部分可观测任务中的信念更新算子belief update operator in partially observable tasks
U, u用自益得到的回报估计随机变量;小写的u还表示置信上界TD target; (lower case only) upper bound
V, v状态价值state value
Vπ, vπ策略π的状态价值(大写只用于值分布学习)state value of the policy π (upper case only used in distributional RL)
V, v最优状态价值(大写只用于值分布学习)optimal state values (upper case only used in distributional RL)
v状态价值的向量表示vector representation of state values
Var方差variance
w价值估计参数parameters of value function estimate
X, x一般的事件an event
X一般的事件空间event space
z资格迹参数parameters for eligibility trace
希腊字母含义英文含义
α学习率learning rate
β资格迹算法中的强化强度;值分布学习中的扭曲函数reinforce strength in eligibility trace; distortion function in distributional RL
γ折扣因子discount factor
Δ, δ时序差分误差TD error
ε探索参数parameters for exploration
λ资格迹衰减强度decay strength of eligibility trace
Π, π策略policy
π最优策略optimal policy
πE模仿学习中的专家策略expert policy in imitation learning
θ策略估计参数parameters for policy function estimates
ϑ价值迭代终止阈值threshold for value iteration
ρ访问频次;异策算法中的重要性采样比率visitation frequency; important sampling ratio in off-policy learning
ρ访问频次的向量表示vector representation of visitation frequency
τ, τ半Markov决策过程中的逗留时间sojourn time of SMDP
T, 𝞽轨迹trajectory
Ω, ω值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
Ψ扩展的优势估计Generalized Advantage Estimate (GAE)
其他符号含义英文含义
=d分布相同share the same distribution
=a.e.几乎处处相等equal almost everywhere
<, , , >普通数值比较;向量逐元素比较compare numbers; element-wise comparison
, , , 策略的偏序关系partial order of policy
绝对连续absolute continuous
空集empty set
梯度gradient
服从分布obey a distribution
||实数的绝对值;向量或矩阵的逐元素求绝对值;集合的元素个数absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set

 

\ No newline at end of file diff --git a/zh2023/.DS_Store b/zh2023/.DS_Store index 60371a2fce30f889f497cb34ad65dfa32fe47c2d..05036988d52daaebee51ae2becb95605da1ec676 100644 GIT binary patch delta 581 zcmZuvu}*_f6g@nYQZ?dGyE!~G zCbrxYU%A7%7qs1ymsmB?th>0vIa55tCBH2UsEl4332)m>#bGam3gm1g_dV$Lw5$ug>*n6}80AB)qu~2NHo}wrV0|Nsi1A_nqLkUA-Qh9MfQcix-#=_-{j4YGASU*ob r%Jycn2!}As#0JjI>>T_YK%JWfIlePb<`;3~0O|#4X4xDevW6J|1?v^s diff --git a/zh2023/code.md b/zh2023/code.md index 33e43c6..8df08d7 100644 --- a/zh2023/code.md +++ b/zh2023/code.md @@ -3,7 +3,7 @@ | \# | 代码内容 | | --- | --- | | [代码1-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 查看`MountainCar-v0`的观测空间和动作空间 | -| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2023/codeMountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | +| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | | [代码1-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 智能体和环境交互一个回合的代码 | | [代码1-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 运行100回合求平均以测试性能 | | [代码1-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | 查看`MountainCarContinuous-v0`的观测空间和动作空间 | diff --git a/zh2023/notation.html b/zh2023/notation.html index 9246024..a8db2cf 100644 --- a/zh2023/notation.html +++ b/zh2023/notation.html @@ -738,9 +738,9 @@ mjx-container[jax="SVG"] path[data-c], mjx-container[jax="SVG"] use[data-c] { stroke-width: 0; } -zh2023notation.20230518 +《强化学习:原理与Python实战》数学记号
-

《强化学习:原理与Python实战》字母

一般规律

  • 大写是随机事件或随机变量,小写是确定性事件或确定性变量。
  • 衬线体(如X)是数值,非衬线体(如X)则不一定是数值。为概率计算统计量的算子(包括EPrVarH)不斜体。
  • 粗体是向量(如w)或矩阵(如F)(矩阵用大写,即使是确定量也是如此)。
  • 花体(如X)是集合。
  • 哥特体(如 f )是映射。

字母表

下表列出常用字母。部分小节会局部定义的字母,以该局部定义为准。

拉丁字母含义英文含义
A, a优势advantage
A, a动作action
A动作空间action space
B, b策略梯度算法中的基线;部分可观测任务中的数值化信念;(仅小写)额外量;异策学习时的行为策略baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, b部分可观测任务中的信念belief in partially observable tasks
Bπ, bπ策略π的Bellman期望算子(大写只用于值分布学习)Bellman expectation operator of policy π (upper case only used in distributional RL)
B, bBellman最优算子(大写只用于值分布学习)Bellman optimal operator (upper case only used in distributional RL)
B经验回放中抽取的一批经验;部分可观测任务中的信念空间a batch of transition generated by experience replay; belief space in partially observable tasks
B+部分可观测任务中带终止信念的信念空间belief space with terminal belief in partially observable tasks
c计数值;线性规划的目标系数counting; coefficients in linear programming
d, d度量metrics
dff散度f-divergence
dKLKL散度KL divergence
dJSJS散度JS divergence
dTV全变差total variation
D回合结束指示indicator of episode end
D经验集set of experience
e资格迹eligibility trace
e自然对数the constant e (2.72)
E期望expectation
f一般的映射a mapping
FFisher信息矩阵Fisher information matrix
G, g回报return
g梯度向量gradient vector
h动作偏好action preference
Hentropy
k迭代次数指标index of iteration
损失loss
p概率值,动力probability, dynamics
P转移矩阵transition matrix
o部分可观测环境的观测概率;渐近无穷小observation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~渐近无穷大infinite in asymptotic notations
O, o观测observation
Pr概率probability
Q, q动作价值action value
Qπ, qπ策略π的动作价值(大写只用于值分布学习)action value of policy π (upper case only used in distributional RL)
Q, q最优动作价值(大写只用于值分布学习)optimal action values (upper case only used in distributional RL)
q动作价值的向量表示vector representation of action values
R, r奖励reward
R奖励空间reward space
S, s状态state
S状态空间state space
S+带终止状态的状态空间state space with terminal state
T回合步数steps in an episode
u部分可观测任务中的信念更新算子belief update operator in partially observable tasks
U, u用自益得到的回报估计随机变量;小写的u还表示置信上界TD target; (lower case only) upper bound
V, v状态价值state value
Vπ, vπ策略π的状态价值(大写只用于值分布学习)state value of the policy π (upper case only used in distributional RL)
V, v最优状态价值(大写只用于值分布学习)optimal state values (upper case only used in distributional RL)
v状态价值的向量表示vector representation of state values
Var方差variance
w价值估计参数parameters of value function estimate
X, x一般的事件an event
X一般的事件空间event space
z资格迹参数parameters for eligibility trace
希腊字母含义英文含义
α学习率learning rate
β资格迹算法中的强化强度;值分布学习中的扭曲函数reinforce strength in eligibility trace; distortion function in distributional RL
γ折扣因子discount factor
Δ, δ时序差分误差TD error
ε探索参数parameters for exploration
λ资格迹衰减强度decay strength of eligibility trace
Π, π策略policy
π最优策略optimal policy
π圆周率the constant π (3.14)
θ策略估计参数parameters for policy function estimates
ϑ价值迭代终止阈值threshold for value iteration
ρ访问频次;异策算法中的重要性采样比率visitation frequency; important sampling ratio in off-policy learning
ρ访问频次的向量表示vector representation of visitation frequency
τ, τ半Markov决策过程中的逗留时间sojourn time of SMDP
T, 𝞽轨迹trajectory
Ω, ω值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
Ψ扩展的优势估计Generalized Advantage Estimate (GAE)
其他符号含义英文含义
=d分布相同share the same distribution
=几乎处处几乎处处相等equal almost everywhere
<, , , >普通数值比较;向量逐元素比较compare numbers; element-wise comparison
, , , 策略的偏序关系partial order of policy
绝对连续absolute continuous
空集empty set
梯度gradient
服从分布obey a distribution
||实数的绝对值;向量或矩阵的逐元素求绝对值;集合的元素个数absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set

 

+

《强化学习:原理与Python实战》数学记号

一般规律

  • 大写是随机事件或随机变量,小写是确定性事件或确定性变量。
  • 衬线体(如X)是数值,非衬线体(如X)则不一定是数值。为概率计算统计量的算子(包括EPrVarH)不斜体。
  • 粗体是向量(如w)或矩阵(如F)(矩阵用大写,即使是确定量也是如此)。
  • 花体(如X)是集合。
  • 哥特体(如 f )是映射。

数学记号表

下表列出常用记号。部分小节会有局部定义的记号,以该局部定义为准。

英语字母含义英文含义
A, a优势advantage
A, a动作action
A动作空间action space
B, b策略梯度算法中的基线;部分可观测任务中的数值化信念;(仅小写)额外量;异策学习时的行为策略baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning
B, b部分可观测任务中的信念belief in partially observable tasks
Bπ, bπ策略π的Bellman期望算子(大写只用于值分布学习)Bellman expectation operator of policy π (upper case only used in distributional RL)
B, bBellman最优算子(大写只用于值分布学习)Bellman optimal operator (upper case only used in distributional RL)
B经验回放中抽取的一批经验;部分可观测任务中的信念空间a batch of transition generated by experience replay; belief space in partially observable tasks
B+部分可观测任务中带终止信念的信念空间belief space with terminal belief in partially observable tasks
c计数值;线性规划的目标系数counting; coefficients in linear programming
d, d度量metrics
dff散度f-divergence
dKLKL散度KL divergence
dJSJS散度JS divergence
dTV全变差total variation
D回合结束指示indicator of episode end
D经验集set of experience
e资格迹eligibility trace
e自然对数the constant e (2.72)
E期望expectation
f一般的映射a mapping
FFisher信息矩阵Fisher information matrix
G, g回报return
g梯度向量gradient vector
h动作偏好action preference
Hentropy
k迭代次数指标index of iteration
损失loss
p概率值,动力probability, dynamics
P转移矩阵transition matrix
o部分可观测环境的观测概率;渐近无穷小observation probability in partially observable tasks; infinitesimal in asymptotic notations
O, O~渐近无穷大infinite in asymptotic notations
O, o观测observation
Pr概率probability
Q, q动作价值action value
Qπ, qπ策略π的动作价值(大写只用于值分布学习)action value of policy π (upper case only used in distributional RL)
Q, q最优动作价值(大写只用于值分布学习)optimal action values (upper case only used in distributional RL)
q动作价值的向量表示vector representation of action values
R, r奖励reward
R奖励空间reward space
S, s状态state
S状态空间state space
S+带终止状态的状态空间state space with terminal state
T回合步数steps in an episode
T, t轨迹trajectory
u部分可观测任务中的信念更新算子belief update operator in partially observable tasks
U, u用自益得到的回报估计随机变量;小写的u还表示置信上界TD target; (lower case only) upper bound
V, v状态价值state value
Vπ, vπ策略π的状态价值(大写只用于值分布学习)state value of the policy π (upper case only used in distributional RL)
V, v最优状态价值(大写只用于值分布学习)optimal state values (upper case only used in distributional RL)
v状态价值的向量表示vector representation of state values
Var方差variance
w价值估计参数parameters of value function estimate
X, x一般的事件an event
X一般的事件空间event space
z资格迹参数parameters for eligibility trace
希腊字母含义英文含义
α学习率learning rate
β资格迹算法中的强化强度;值分布学习中的扭曲函数reinforce strength in eligibility trace; distortion function in distributional RL
γ折扣因子discount factor
Δ, δ时序差分误差TD error
ε探索参数parameters for exploration
λ资格迹衰减强度decay strength of eligibility trace
Π, π策略policy
π最优策略optimal policy
π圆周率the constant π (3.14)
θ策略估计参数parameters for policy function estimates
ϑ价值迭代终止阈值threshold for value iteration
ρ访问频次;异策算法中的重要性采样比率visitation frequency; important sampling ratio in off-policy learning
ρ访问频次的向量表示vector representation of visitation frequency
τ, τ半Markov决策过程中的逗留时间sojourn time of SMDP
Ω, ω值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks
Ψ扩展的优势估计Generalized Advantage Estimate (GAE)
其他符号含义英文含义
=d分布相同share the same distribution
=几乎处处几乎处处相等equal almost everywhere
<, , , >普通数值比较;向量逐元素比较compare numbers; element-wise comparison
, , , 策略的偏序关系partial order of policy
绝对连续absolute continuous
空集empty set
梯度gradient
服从分布obey a distribution
||实数的绝对值;向量或矩阵的逐元素求绝对值;集合的元素个数absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set

 

\ No newline at end of file