Update codes

ZhiqingXiao · Jun 23, 2024 · a1e8538 · a1e8538
1 parent 6999288
commit a1e8538
Show file tree

Hide file tree

Showing 207 changed files with 734 additions and 461 deletions.
diff --git a/README.md b/README.md
diff --git a/en2023/README.md → en2024/README.md b/en2023/README.md → en2024/README.md
diff --git a/en2023/abbreviation.md → en2024/abbreviation.md b/en2023/abbreviation.md → en2024/abbreviation.md
@@ -47,6 +47,7 @@
 | HRL | Hierarchical Reinforcement Learning |
 | IL | Imitation Learning |
 | IQN | Implicit Quantile Networks |
+| IRL | Inverse Reinforcement Learning |
 | JSD | Jensen-Shannon Divergence |
 | KLD | Kullback–Leibler Divergence |
 | MAB | Multi-Arm Bandit |
@@ -63,6 +64,7 @@
 | OffPAC | Off-Policy Actor–Critic |
 | OPDAC | Off-Policy Deterministic Actor–Critic |
 | OU | Ornstein Uhlenbeck |
+| PbRL | Preference-based Reinforcement Learning |
 | PBVI | Point-Based Value Iteration |
 | PDF | Probability Distribution Function |
 | PER | Prioritized Experience Replay |
@@ -80,6 +82,7 @@
 | ReLU | Rectified Linear Unit |
 | RL | Reinforcement Learning |
 | RLHF | Reinforcement Learning with Human Feedback |
+| RM | Reward Model |
 | SAC | Soft Actor–Critic |
 | SARSA | State-Action-Reward-State-Action |
 | SGD | Stochastic Gradient Descent  |

diff --git a/en2023/abbreviation_zh.md → en2024/abbreviation_zh.md b/en2023/abbreviation_zh.md → en2024/abbreviation_zh.md
@@ -47,6 +47,7 @@
 | HRL | 分层强化学习 | Hierarchical Reinforcement Learning |
 | IL | 模仿学习 | Imitation Learning |
 | IQN | 含蓄分位网络 | Implicit Quantile Networks |
+| IRL | 逆强化学习 | Inverse Reinforcement Learning |
 | JSD | Jensen-Shannon散度 | Jensen-Shannon Divergence |
 | KLD | Kullback–Leibler散度 | Kullback–Leibler Divergence |
 | MAB | 多臂赌博机 | Multi-Arm Bandit |
@@ -63,6 +64,7 @@
 | OffPAC | 异策的执行者/评论者算法 | Off-Policy Actor–Critic |
 | OPDAC | 异策确定性执行者/评论者算法 | Off-Policy Deterministic Actor–Critic |
 | OU | Ornstein Uhlenbeck过程 | Ornstein Uhlenbeck |
+| PbRL | 偏好强化学习 | Preference-based Reinforcement Learning |
 | PBVI | 点的价值迭代算法 | Point-Based Value Iteration |
 | PDF | 概率分布函数 | Probability Distribution Function |
 | PER | 优先经验回放 | Prioritized Experience Replay |
@@ -80,6 +82,7 @@
 | ReLU | 修正线性单元 | Rectified Linear Unit |
 | RL | 强化学习 | Reinforcement Learning |
 | RLHF | 人类反馈强化学习 | Reinforcement Learning with Human Feedback |
+| RM | 奖励模型 | Reward Model |
 | SAC | 柔性执行者/评论者算法 | Soft Actor–Critic |
 | SARSA | 状态/动作/奖励/状态/动作 | State-Action-Reward-State-Action |
 | SGD | 随机梯度下降 | Stochastic Gradient Descent  |

diff --git a/en2023/bibliography.md → en2024/bibliography.md b/en2023/bibliography.md → en2024/bibliography.md
@@ -9,6 +9,7 @@
 * Bellemare, M. G., Dabney, W., Munos, R. (2017). A distributional perspective on reinforcement learning. https://proceedings.mlr.press/v70/bellemare17a.html
 * Bellman, R. E. (1957). Dynamic Programming. Princeton University Press.
 * Blum, J. R. (1954). Approximation methods which converge with probability one. https://doi.org/10.1214/aoms/1177728794
+* Christina, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. https://arxiv.org/abs/1706.03741
 * Dabney, W., Ostrovski, G., Silver, D., Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. https://arxiv.org/abs/1806.06923
 * Dabney, W., Rowland, M., Bellemare, M. G., Munos, R. (2018). Distributional reinforcement learning with quantile regression. https://ojs.aaai.org/index.php/AAAI/article/view/11791
 * DeJong, G., Spong, M. W. (1994). Swinging up the Acrobot: an example of intelligent control. https://doi.org/10.1109/ACC.1994.752458
@@ -35,6 +36,7 @@
 * Moore, A. W. (1990). Efficient Memory-based Learning for Robot Control. Ph.D. dissertation. Cambridge, UK: University of Cambridge.
 * Nemirovski, A. S., Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley.
 * Neumann, J. v., Morgenstern, O. (1953). Theory of Games and Economic Behavior. Princeton University Press.
+* Ouyang, L., Wu, J., Jing, X., Almeida, D., Wainwright, C. L., ..., Christiano, P., (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155
 * Pavlov, I. P. (1928). Lectures on Conditioned Reflexes, Volume 1 (English translation). International Publishers.
 * Robbins, H., Monro, S. (1951). A stochastic approximation algorithm. https://doi.org/10.1214/aoms/1177729586
 * Rummery, G. A., Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University.

diff --git a/en2023/choice.html → en2024/choice.html b/en2023/choice.html → en2024/choice.html
@@ -23,7 +23,7 @@ <h1>Answers of Multiple Choices</h1>
     <div class="item">Chapter 12: <span class="answer">ABCCCC</span><div>
     <div class="item">Chapter 13: <span class="answer">AABCBB</span><div>
     <div class="item">Chapter 14: <span class="answer">BCBABC</span><div>
-    <div class="item">Chapter 15: <span class="answer">ACCB</span><div>
-    <div class="item">Chapter 16: <span class="answer">CCACC</span><div>
+    <div class="item">Chapter 15: <span class="answer">CCACC</span><div>
+    <div class="item">Chapter 16: <span class="answer">ACCBAC</span><div>
   </body>
 </html>