Skip to content

Commit

Permalink
Update codes
Browse files Browse the repository at this point in the history
  • Loading branch information
ZhiqingXiao committed Jun 23, 2024
1 parent 6999288 commit a1e8538
Show file tree
Hide file tree
Showing 207 changed files with 734 additions and 461 deletions.
44 changes: 20 additions & 24 deletions README.md

Large diffs are not rendered by default.

101 changes: 50 additions & 51 deletions en2023/README.md → en2024/README.md

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions en2023/abbreviation.md → en2024/abbreviation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
| HRL | Hierarchical Reinforcement Learning |
| IL | Imitation Learning |
| IQN | Implicit Quantile Networks |
| IRL | Inverse Reinforcement Learning |
| JSD | Jensen-Shannon Divergence |
| KLD | Kullback–Leibler Divergence |
| MAB | Multi-Arm Bandit |
Expand All @@ -63,6 +64,7 @@
| OffPAC | Off-Policy Actor–Critic |
| OPDAC | Off-Policy Deterministic Actor–Critic |
| OU | Ornstein Uhlenbeck |
| PbRL | Preference-based Reinforcement Learning |
| PBVI | Point-Based Value Iteration |
| PDF | Probability Distribution Function |
| PER | Prioritized Experience Replay |
Expand All @@ -80,6 +82,7 @@
| ReLU | Rectified Linear Unit |
| RL | Reinforcement Learning |
| RLHF | Reinforcement Learning with Human Feedback |
| RM | Reward Model |
| SAC | Soft Actor–Critic |
| SARSA | State-Action-Reward-State-Action |
| SGD | Stochastic Gradient Descent |
Expand Down
3 changes: 3 additions & 0 deletions en2023/abbreviation_zh.md → en2024/abbreviation_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
| HRL | 分层强化学习 | Hierarchical Reinforcement Learning |
| IL | 模仿学习 | Imitation Learning |
| IQN | 含蓄分位网络 | Implicit Quantile Networks |
| IRL | 逆强化学习 | Inverse Reinforcement Learning |
| JSD | Jensen-Shannon散度 | Jensen-Shannon Divergence |
| KLD | Kullback–Leibler散度 | Kullback–Leibler Divergence |
| MAB | 多臂赌博机 | Multi-Arm Bandit |
Expand All @@ -63,6 +64,7 @@
| OffPAC | 异策的执行者/评论者算法 | Off-Policy Actor–Critic |
| OPDAC | 异策确定性执行者/评论者算法 | Off-Policy Deterministic Actor–Critic |
| OU | Ornstein Uhlenbeck过程 | Ornstein Uhlenbeck |
| PbRL | 偏好强化学习 | Preference-based Reinforcement Learning |
| PBVI | 点的价值迭代算法 | Point-Based Value Iteration |
| PDF | 概率分布函数 | Probability Distribution Function |
| PER | 优先经验回放 | Prioritized Experience Replay |
Expand All @@ -80,6 +82,7 @@
| ReLU | 修正线性单元 | Rectified Linear Unit |
| RL | 强化学习 | Reinforcement Learning |
| RLHF | 人类反馈强化学习 | Reinforcement Learning with Human Feedback |
| RM | 奖励模型 | Reward Model |
| SAC | 柔性执行者/评论者算法 | Soft Actor–Critic |
| SARSA | 状态/动作/奖励/状态/动作 | State-Action-Reward-State-Action |
| SGD | 随机梯度下降 | Stochastic Gradient Descent |
Expand Down
2 changes: 2 additions & 0 deletions en2023/bibliography.md → en2024/bibliography.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* Bellemare, M. G., Dabney, W., Munos, R. (2017). A distributional perspective on reinforcement learning. https://proceedings.mlr.press/v70/bellemare17a.html
* Bellman, R. E. (1957). Dynamic Programming. Princeton University Press.
* Blum, J. R. (1954). Approximation methods which converge with probability one. https://doi.org/10.1214/aoms/1177728794
* Christina, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. https://arxiv.org/abs/1706.03741
* Dabney, W., Ostrovski, G., Silver, D., Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. https://arxiv.org/abs/1806.06923
* Dabney, W., Rowland, M., Bellemare, M. G., Munos, R. (2018). Distributional reinforcement learning with quantile regression. https://ojs.aaai.org/index.php/AAAI/article/view/11791
* DeJong, G., Spong, M. W. (1994). Swinging up the Acrobot: an example of intelligent control. https://doi.org/10.1109/ACC.1994.752458
Expand All @@ -35,6 +36,7 @@
* Moore, A. W. (1990). Efficient Memory-based Learning for Robot Control. Ph.D. dissertation. Cambridge, UK: University of Cambridge.
* Nemirovski, A. S., Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley.
* Neumann, J. v., Morgenstern, O. (1953). Theory of Games and Economic Behavior. Princeton University Press.
* Ouyang, L., Wu, J., Jing, X., Almeida, D., Wainwright, C. L., ..., Christiano, P., (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155
* Pavlov, I. P. (1928). Lectures on Conditioned Reflexes, Volume 1 (English translation). International Publishers.
* Robbins, H., Monro, S. (1951). A stochastic approximation algorithm. https://doi.org/10.1214/aoms/1177729586
* Rummery, G. A., Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University.
Expand Down
4 changes: 2 additions & 2 deletions en2023/choice.html → en2024/choice.html
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ <h1>Answers of Multiple Choices</h1>
<div class="item">Chapter 12: <span class="answer">ABCCCC</span><div>
<div class="item">Chapter 13: <span class="answer">AABCBB</span><div>
<div class="item">Chapter 14: <span class="answer">BCBABC</span><div>
<div class="item">Chapter 15: <span class="answer">ACCB</span><div>
<div class="item">Chapter 16: <span class="answer">CCACC</span><div>
<div class="item">Chapter 15: <span class="answer">CCACC</span><div>
<div class="item">Chapter 16: <span class="answer">ACCBAC</span><div>
</body>
</html>
Loading

0 comments on commit a1e8538

Please sign in to comment.