diff --git a/README.md b/README.md index b802296..940a392 100644 --- a/README.md +++ b/README.md @@ -3,9 +3,9 @@ **The First Reinforcement Learning Tutorial Book with one-on-one mapping TensorFlow 2 and PyTorch 1&2 Implementation** -| [English Edition](https://github.com/ZhiqingXiao/rl-book/tree/master/en2023) | [中文版](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2023) | [中文2019版](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2019) | +| [English Edition](https://github.com/ZhiqingXiao/rl-book/tree/master/en2024) | [中文版](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2023) | [中文2019版](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2019) | | :---: | :---: | :---: | -| [![Book](https://zhiqingxiao.github.io/rl-book/en2023/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/en2023) | [![Book](https://zhiqingxiao.github.io/rl-book/zh2023/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2023) | [![Book](https://zhiqingxiao.github.io/rl-book/zh2019/resource/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2019) | +| [![Book](https://zhiqingxiao.github.io/rl-book/en2024/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/en2024) | [![Book](https://zhiqingxiao.github.io/rl-book/zh2023/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2023) | [![Book](https://zhiqingxiao.github.io/rl-book/zh2019/resource/cover.jpg)](https://github.com/ZhiqingXiao/rl-book/tree/master/zh2019) | Please email me if you are interested in publishing this book in other languages. @@ -19,7 +19,7 @@ This is a tutorial book on reinforcement learning, with explanation of theory an ### Supporting contents for English version -Check [here](https://github.com/ZhiqingXiao/rl-book/tree/master/en2023) for codes, exercise answers, etc. +Check [here](https://github.com/ZhiqingXiao/rl-book/tree/master/en2024) for codes, exercise answers, etc. ### Table of Codes @@ -27,27 +27,23 @@ All codes have been saved as a .ipynb file and a .html file in the same director | Chapter | Environment & Closed-Form Policy | Agent | | :--- | :--- | :--- | -| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | -| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | -| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | -| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | -| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | -| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | -| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | -| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | -| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | -| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | -| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | -| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | -| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | -| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | -| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | -| 15 note | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | -| 16 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) - - -Note: -1. It does not work with Gym >=0.25 and PyBullet 3.2.4. It is because Gym 0.25 changed `metadata["render.modes"]` to `metadata["render_modes"]`, but PyBullet releases have not updated accordingly yet. +| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | +| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | +| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | +| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | +| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | +| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | +| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | +| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | +| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | +| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | +| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | +| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | +| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | +| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | +| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | +| 15 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) +| 16 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | # 强化学习:原理与Python实战 diff --git a/en2023/README.md b/en2024/README.md similarity index 54% rename from en2023/README.md rename to en2024/README.md index e1b5d39..a5ec59e 100644 --- a/en2023/README.md +++ b/en2024/README.md @@ -36,41 +36,41 @@ All chapters are accompanied with Python codes. 12. Distributional RL 13. Minimize Regret 14. Tree Search -15. IL: Imitation Learning -16. More Agent-Environment Interface +15. More Agent-Environment Interface +16. Learning from Feedback and Imitation Learning ### Resources -- Reference answers of multiple choices: [link](https://zhiqingxiao.github.io/rl-book/en2023/choice.html) -- Guide to set up developing environment: [Windows](https://github.com/ZhiqingXiao/rl-book/blob/master/en2023/setup/setupwin.md) [macOS](https://github.com/ZhiqingXiao/rl-book/blob/master/en2023/setup/setupmac.md) -- Table of notations: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2023/notation.md) -- Table of abbreviations: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2023/abbreviation.md) -- Gym Internal: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2023/gym.md) -- Bibliography: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2023/bibliography.md) +- Reference answers of multiple choices: [link](https://zhiqingxiao.github.io/rl-book/en2024/choice.html) +- Guide to set up developing environment: [Windows](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupwin.md) [macOS](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/setup/setupmac.md) +- Table of notations: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/notation.md) +- Table of abbreviations: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/abbreviation.md) +- Gym Internal: [link](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/gym.md) +- Bibliography: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/bibliography.md) ### Table of Codes -List view: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2023/code.md) +List view: [link](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/code.md) | Chapter | Environment & Closed-Form Policy | Agent | | :--- | :--- | :--- | -| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | -| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | -| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | -| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | -| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | -| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | -| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | -| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | -| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | -| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | -| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | -| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | -| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | -| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | -| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | -| 15 note | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | -| 16 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) +| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | +| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | +| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | +| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | +| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | +| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | +| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | +| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | +| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | +| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | +| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | +| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | +| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | +| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | +| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | +| 15 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) +| 16 note | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | Note: @@ -79,10 +79,10 @@ Note: ### BibTeX - @book{xiao2023, + @book{xiao2024, title = {Reinforcement Learning: Theory and {Python} Implementation}, author = {Zhiqing Xiao} - year = 2023, + year = 2024, publisher = {Springer Nature}, } @@ -109,24 +109,23 @@ Note: | 章 | 环境和闭式解 | 智能体 | | :--- | :--- | :--- | -| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | -| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | -| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | -| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | -| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | -| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | -| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | -| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | -| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | -| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | -| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | -| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | -| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv_demo.html) | -| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | -| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | -| 15 注 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | -| 16 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) - +| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | +| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | +| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | +| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | +| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | +| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | +| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | +| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | +| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | +| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | +| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | +| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | +| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv_demo.html) | +| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | +| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | +| 15 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) +| 16 注 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | 注: 1. 此案例在 Gym >=0.25 且 PyBullet 3.2.4 的环境中无法工作。原因:Gym 0.25 把 `metadata["render.modes"]` 改成了 `metadata["render_modes"]`, 但是 PyBullet 未更新发行版。 @@ -134,12 +133,12 @@ Note: ### 中英双语资源 -- 习题参考答案:[链接](https://zhiqingxiao.github.io/rl-book/en2023/choice.html) +- 习题参考答案:[链接](https://zhiqingxiao.github.io/rl-book/en2024/choice.html) - 开发环境搭建:[Windows](https://github.com/ZhiqingXiao/rl-book/blob/master/zh2023/setup/setupwin.md) [macOS](https://github.com/ZhiqingXiao/rl-book/blob/master/zh2023/setup/setupmac.md) -- 字母表:[链接](https://github.com/ZhiqingXiao/rl-book/blob/master/en2023/notation_zh.md) -- 缩略语表:[链接](https://github.com/zhiqingxiao/rl-book/blob/master/en2023/abbreviation_zh.md) +- 字母表:[链接](https://github.com/ZhiqingXiao/rl-book/blob/master/en2024/notation_zh.md) +- 缩略语表:[链接](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/abbreviation_zh.md) - Gym源码解读:[链接](https://github.com/ZhiqingXiao/rl-book/blob/master/zh2023/gym.md) -- 参考文献:[链接](https://github.com/zhiqingxiao/rl-book/blob/master/en2023/bibliography.md) +- 参考文献:[链接](https://github.com/zhiqingxiao/rl-book/blob/master/en2024/bibliography.md) **QQ群** @@ -147,7 +146,7 @@ Note: **常见问题** -- 问:Windows系统下安装TensorFlow或PyTorch失败。答:请在Windows 10/11里安装Visual Studio 2022(如果有旧版本的Visual Studio请先彻底卸载)。请阅读本书的[开发环境搭建指南](https://zhiqingxiao.github.io/rl-book/en2023/setupwin_zh.html)。更多细节和安装问题请自行Google。PyTorch安装可参阅:https://mp.weixin.qq.com/s/uRx1XOPrfFOdMlRU6I-eyA +- 问:Windows系统下安装TensorFlow或PyTorch失败。答:请在Windows 10/11里安装Visual Studio 2022(如果有旧版本的Visual Studio请先彻底卸载)。请阅读本书的[开发环境搭建指南](https://zhiqingxiao.github.io/rl-book/en2024/setupwin_zh.html)。更多细节和安装问题请自行Google。PyTorch安装可参阅:https://mp.weixin.qq.com/s/uRx1XOPrfFOdMlRU6I-eyA - 问:在Visual Studio或Visual Studio Code或PyCharm里面运行代码失败,比如找不到函数`display()`。答:本repo代码是配套Jupyter Notebook环境的,只能在Jupyter Notebook里运行。推荐您安装最新版本的Anaconda并直接运行下载来的Notebook。(`display()`函数是Jupyter Notebook里才有的函数。)不需要安装Visual Studio Code或PyCharm。更多细节或其他错误请自行Google。 diff --git a/en2023/abbreviation.md b/en2024/abbreviation.md similarity index 97% rename from en2023/abbreviation.md rename to en2024/abbreviation.md index 5f2cd1c..f55ea12 100644 --- a/en2023/abbreviation.md +++ b/en2024/abbreviation.md @@ -47,6 +47,7 @@ | HRL | Hierarchical Reinforcement Learning | | IL | Imitation Learning | | IQN | Implicit Quantile Networks | +| IRL | Inverse Reinforcement Learning | | JSD | Jensen-Shannon Divergence | | KLD | Kullback–Leibler Divergence | | MAB | Multi-Arm Bandit | @@ -63,6 +64,7 @@ | OffPAC | Off-Policy Actor–Critic | | OPDAC | Off-Policy Deterministic Actor–Critic | | OU | Ornstein Uhlenbeck | +| PbRL | Preference-based Reinforcement Learning | | PBVI | Point-Based Value Iteration | | PDF | Probability Distribution Function | | PER | Prioritized Experience Replay | @@ -80,6 +82,7 @@ | ReLU | Rectified Linear Unit | | RL | Reinforcement Learning | | RLHF | Reinforcement Learning with Human Feedback | +| RM | Reward Model | | SAC | Soft Actor–Critic | | SARSA | State-Action-Reward-State-Action | | SGD | Stochastic Gradient Descent | diff --git a/en2023/abbreviation_zh.md b/en2024/abbreviation_zh.md similarity index 97% rename from en2023/abbreviation_zh.md rename to en2024/abbreviation_zh.md index 44b9de6..56f0fa3 100644 --- a/en2023/abbreviation_zh.md +++ b/en2024/abbreviation_zh.md @@ -47,6 +47,7 @@ | HRL | 分层强化学习 | Hierarchical Reinforcement Learning | | IL | 模仿学习 | Imitation Learning | | IQN | 含蓄分位网络 | Implicit Quantile Networks | +| IRL | 逆强化学习 | Inverse Reinforcement Learning | | JSD | Jensen-Shannon散度 | Jensen-Shannon Divergence | | KLD | Kullback–Leibler散度 | Kullback–Leibler Divergence | | MAB | 多臂赌博机 | Multi-Arm Bandit | @@ -63,6 +64,7 @@ | OffPAC | 异策的执行者/评论者算法 | Off-Policy Actor–Critic | | OPDAC | 异策确定性执行者/评论者算法 | Off-Policy Deterministic Actor–Critic | | OU | Ornstein Uhlenbeck过程 | Ornstein Uhlenbeck | +| PbRL | 偏好强化学习 | Preference-based Reinforcement Learning | | PBVI | 点的价值迭代算法 | Point-Based Value Iteration | | PDF | 概率分布函数 | Probability Distribution Function | | PER | 优先经验回放 | Prioritized Experience Replay | @@ -80,6 +82,7 @@ | ReLU | 修正线性单元 | Rectified Linear Unit | | RL | 强化学习 | Reinforcement Learning | | RLHF | 人类反馈强化学习 | Reinforcement Learning with Human Feedback | +| RM | 奖励模型 | Reward Model | | SAC | 柔性执行者/评论者算法 | Soft Actor–Critic | | SARSA | 状态/动作/奖励/状态/动作 | State-Action-Reward-State-Action | | SGD | 随机梯度下降 | Stochastic Gradient Descent | diff --git a/en2023/bibliography.md b/en2024/bibliography.md similarity index 96% rename from en2023/bibliography.md rename to en2024/bibliography.md index 02526e3..2cc40b0 100644 --- a/en2023/bibliography.md +++ b/en2024/bibliography.md @@ -9,6 +9,7 @@ * Bellemare, M. G., Dabney, W., Munos, R. (2017). A distributional perspective on reinforcement learning. https://proceedings.mlr.press/v70/bellemare17a.html * Bellman, R. E. (1957). Dynamic Programming. Princeton University Press. * Blum, J. R. (1954). Approximation methods which converge with probability one. https://doi.org/10.1214/aoms/1177728794 +* Christina, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. https://arxiv.org/abs/1706.03741 * Dabney, W., Ostrovski, G., Silver, D., Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. https://arxiv.org/abs/1806.06923 * Dabney, W., Rowland, M., Bellemare, M. G., Munos, R. (2018). Distributional reinforcement learning with quantile regression. https://ojs.aaai.org/index.php/AAAI/article/view/11791 * DeJong, G., Spong, M. W. (1994). Swinging up the Acrobot: an example of intelligent control. https://doi.org/10.1109/ACC.1994.752458 @@ -35,6 +36,7 @@ * Moore, A. W. (1990). Efficient Memory-based Learning for Robot Control. Ph.D. dissertation. Cambridge, UK: University of Cambridge. * Nemirovski, A. S., Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley. * Neumann, J. v., Morgenstern, O. (1953). Theory of Games and Economic Behavior. Princeton University Press. +* Ouyang, L., Wu, J., Jing, X., Almeida, D., Wainwright, C. L., ..., Christiano, P., (2022). Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155 * Pavlov, I. P. (1928). Lectures on Conditioned Reflexes, Volume 1 (English translation). International Publishers. * Robbins, H., Monro, S. (1951). A stochastic approximation algorithm. https://doi.org/10.1214/aoms/1177729586 * Rummery, G. A., Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University. diff --git a/en2023/choice.html b/en2024/choice.html similarity index 91% rename from en2023/choice.html rename to en2024/choice.html index 94dccea..03103a8 100644 --- a/en2023/choice.html +++ b/en2024/choice.html @@ -23,7 +23,7 @@

Answers of Multiple Choices

Chapter 12: ABCCCC
Chapter 13: AABCBB
Chapter 14: BCBABC
-
Chapter 15: ACCB
-
Chapter 16: CCACC
+
Chapter 15: CCACC
+
Chapter 16: ACCBAC
\ No newline at end of file diff --git a/en2023/code.md b/en2024/code.md similarity index 54% rename from en2023/code.md rename to en2024/code.md index 19050fc..8f75878 100644 --- a/en2023/code.md +++ b/en2024/code.md @@ -2,139 +2,139 @@ | \# | Caption | | --- | --- | -| [Code 1-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Check the observation space and action space of the environment | -| [Code 1-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Closed-form agent for task `MountainCar-v0` | -| [Code 1-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Play an episode | -| [Code 1-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | Test the performance by playing 100 episodes | -| [Code 1-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | Check the observation space and action space of the task `MountainCarContinuous-v0` | -| [Code 1-6](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | Closed-form agent for task `MountainCarContinous-v0` | -| [Code 2-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HungryFull_demo.html) | Use the example Bellman expectation equation | -| [Code 2-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HungryFull_demo.html) | Example to solve Bellman optimal equation | -| [Code 2-3](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | Import the environment `CliffWalking-v0` and check its information | -| [Code 2-4](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | Find states values and action values using Bellman expectation equations | -| [Code 2-5](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | Find optimal values using LP method | -| [Code 2-6](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | Find an optimal deterministic policy from optimal action values | -| [Code 3-1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Check the metadata of `FrozenLake-v1` | -| [Code 3-2](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Play an episode using the policy | -| [Code 3-3](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Calculate the episode rewards of the random policy | -| [Code 3-4](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Implementation of Policy Evaluation | -| [Code 3-5](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Evaluate the random policy | -| [Code 3-6](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Policy improvement | -| [Code 3-7](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Improve the random policy | -| [Code 3-8](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Policy iteration | -| [Code 3-9](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Use policy iteration to find the optimal policy and test it | -| [Code 3-10](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | VI | -| [Code 3-11](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | Find the optimal policy using the value iteration algorithm | -| [Code 4-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | Play an episode | -| [Code 4-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | On-Policy MC evaluation | -| [Code 4-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | Visualize a 3-dimension np.array, which can be indexed by a state | -| [Code 4-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | On-policy MC update with exploring start | -| [Code 4-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | MC update with soft policy | -| [Code 4-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | Policy evaluation based on importance sampling | -| [Code 4-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | Importance sampling policy optimization with soft policy | -| [Code 5-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | Initialize and visualize the task | -| [Code 5-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | SARSA agent | -| [Code 5-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | Train the agent | -| [Code 5-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html) | Expected SARSA agent | -| [Code 5-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html) | Q Learning agent | -| [Code 5-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html) | Double Q Learning agent | -| [Code 5-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | SARSA $(\lambda)$ agent | -| [Code 6-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | Import the environment of `MountainCar-v0` | -| [Code 6-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | The agent that always pushes right | -| [Code 6-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | Tile coding | -| [Code 6-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | SARSA agent with function approximation | -| [Code 6-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html)| SARSA $(\lambda)$ agent with function approximation | -| [Code 6-6](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) | Experience replayer | -| [Code 6-7](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) | DQN agent with target network (with TensorFlow) | -| [Code 6-8](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html) | DQN agent with target network (with PyTorch) | -| [Code 6-9](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) | Double DQN agent (with TensorFlow) | -| [Code 6-10](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html) | Double DQN agent (with PyTorch) | -| [Code 6-11](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) | Dueling network (with TensorFlow) | -| [Code 6-12](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | Dueling network (with PyTorch) | -| [Code 6-13](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) | Dueling DQN agent (with TensorFlow) | -| [Code 6-14](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | Dueling DQN agent (with PyTorch) | -| [Code 7-1](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) | On-policy VPG agent (with TensorFlow) | -| [Code 7-2](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html) | On-policy VPG agent (with PyTorch) | -| [Code 7-3](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) | On-policy VPG agent with baseline (with TensorFlow) | -| [Code 7-4](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html) | On-policy VPG agent with baseline (with PyTorch) | -| [Code 7-5](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) | Off-policy PG agent (with TensorFlow) | -| [Code 7-6](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html) | Off-policy PG agent (with PyTorch) | -| [Code 7-7](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) | Off-policy PG agent with baseline (with TensorFlow) | -| [Code 7-8](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | Off-policy PG agent with baseline (with PyTorch) | -| [Code 8-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) | Action-value AC agent (with TensorFlow) | -| [Code 8-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html) | Action-value AC agent (with PyTorch) | -| [Code 8-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) | Advantage AC agent (with TensorFlow) | -| [Code 8-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html) | Advantage AC agent (with PyTorch) | -| [Code 8-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) | Eligibility-trace AC agent (with TensorFlow) | -| [Code 8-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html) | Eligibility-trace AC agent (with PyTorch) | -| [Code 8-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) | Replayer for PPO | -| [Code 8-8](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) | PPO agent (with TensorFlow) | -| [Code 8-9](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html) | PPO agent (with PyTorch) | -| [Code 8-10](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) | Calculate CG (with TensorFlow) | -| [Code 8-11](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html) | Calculate CG (with PyTorch) | -| [Code 8-12](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) | NPG agent (with TensorFlow) | -| [Code 8-13](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html) | NPG agent (with PyTorch) | -| [Code 8-14](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) | TRPO agent (with TensorFlow) | -| [Code 8-15](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html) | TRPO agent (with PyTorch) | -| [Code 8-16](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) | OffPAC agent (with TensorFlow) | -| [Code 8-17](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | OffPAC agent (with PyTorch) | -| [Code 9-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) | OU process | -| [Code 9-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) | DDPG agent (with TensorFlow) | -| [Code 9-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html) | DDPG agent (with PyTorch) | -| [Code 9-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) | TD3 agent (with TensorFlow) | -| [Code 9-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | TD3 agent (with PyTorch) | -| [Code 10-1](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | Closed-form solution of `LunarLander-v2` | -| [Code 10-2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | Closed-form solution of `LunarLanderContinuous-v2` | -| [Code 10-3](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) | SQL agent (with TensorFlow) | -| [Code 10-4](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html) | SQL agent (with PyTorch) | -| [Code 10-5](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) | SAC agent (with TensorFlow) | -| [Code 10-6](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html) | SAC agent (with PyTorch) | -| [Code 10-7](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) | SAC with automatic entropy adjustment (with TensorFlow) | -| [Code 10-8](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | SAC with automatic entropy adjustment (with PyTorch) | -| [Code 10-9](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) | SAC with automatic entropy adjustment for continuous action space (with TensorFlow) | -| [Code 10-10](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | SAC with automatic entropy adjustment for continuous action space (with PyTorch) | -| [Code 11-1](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | Closed-form solution of `BipedalWalker-v3` | -| [Code 11-2](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html) | ES agent | -| [Code 11-3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html) | Train and test ES agent | -| [Code 11-4](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | ARS agent | -| [Code 12-1](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | Closed-form solution of `PongNoFrameskip-v4` | -| [Code 12-2](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | Wrapped environment class | -| [Code 12-3](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | Categorical DQN agent (with TensorFlow) | -| [Code 12-4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html) | Categorical DQN agent (with PyTorch) | -| [Code 12-5](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) | QR-DQN agent (with TensorFlow) | -| [Code 12-6](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html) | QR-DQN agent (with PyTorch) | -| [Code 12-7](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) | Quantile network (with TensorFlow) | -| [Code 12-8](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | Quantile network (with PyTorch) | -| [Code 12-9](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) | IQN agent (with TensorFlow) | -| [Code 12-10](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | IQN agent (with PyTorch) | -| [Code 13-1](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | The environment class `BernoulliMABEnv` | -| [Code 13-2](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | Register the environment class `BernoulliMABEnv` into Gym | -| [Code 13-3](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | $\epsilon$-greedy policy agent | -| [Code 13-4](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | Evaluate average regret | -| [Code 13-5](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | UCB1 agent | -| [Code 13-6](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | Bayesian UCB agent | -| [Code 13-7](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | Thompson sampling agent | +| [Code 1-1](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | Check the observation space and action space of the environment | +| [Code 1-2](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | Closed-form agent for task `MountainCar-v0` | +| [Code 1-3](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | Play an episode | +| [Code 1-4](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | Test the performance by playing 100 episodes | +| [Code 1-5](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCarContinuous-v0_ClosedForm.html) | Check the observation space and action space of the task `MountainCarContinuous-v0` | +| [Code 1-6](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCarContinuous-v0_ClosedForm.html) | Closed-form agent for task `MountainCarContinous-v0` | +| [Code 2-1](https://zhiqingxiao.github.io/rl-book/en2024/code/HungryFull_demo.html) | Use the example Bellman expectation equation | +| [Code 2-2](https://zhiqingxiao.github.io/rl-book/en2024/code/HungryFull_demo.html) | Example to solve Bellman optimal equation | +| [Code 2-3](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | Import the environment `CliffWalking-v0` and check its information | +| [Code 2-4](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | Find states values and action values using Bellman expectation equations | +| [Code 2-5](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | Find optimal values using LP method | +| [Code 2-6](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | Find an optimal deterministic policy from optimal action values | +| [Code 3-1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Check the metadata of `FrozenLake-v1` | +| [Code 3-2](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Play an episode using the policy | +| [Code 3-3](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Calculate the episode rewards of the random policy | +| [Code 3-4](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Implementation of Policy Evaluation | +| [Code 3-5](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Evaluate the random policy | +| [Code 3-6](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Policy improvement | +| [Code 3-7](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Improve the random policy | +| [Code 3-8](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Policy iteration | +| [Code 3-9](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Use policy iteration to find the optimal policy and test it | +| [Code 3-10](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | VI | +| [Code 3-11](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | Find the optimal policy using the value iteration algorithm | +| [Code 4-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | Play an episode | +| [Code 4-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | On-Policy MC evaluation | +| [Code 4-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | Visualize a 3-dimension np.array, which can be indexed by a state | +| [Code 4-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | On-policy MC update with exploring start | +| [Code 4-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | MC update with soft policy | +| [Code 4-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | Policy evaluation based on importance sampling | +| [Code 4-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | Importance sampling policy optimization with soft policy | +| [Code 5-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | Initialize and visualize the task | +| [Code 5-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | SARSA agent | +| [Code 5-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | Train the agent | +| [Code 5-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html) | Expected SARSA agent | +| [Code 5-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html) | Q Learning agent | +| [Code 5-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html) | Double Q Learning agent | +| [Code 5-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | SARSA $(\lambda)$ agent | +| [Code 6-1](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | Import the environment of `MountainCar-v0` | +| [Code 6-2](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | The agent that always pushes right | +| [Code 6-3](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | Tile coding | +| [Code 6-4](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | SARSA agent with function approximation | +| [Code 6-5](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html)| SARSA $(\lambda)$ agent with function approximation | +| [Code 6-6](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) | Experience replayer | +| [Code 6-7](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) | DQN agent with target network (with TensorFlow) | +| [Code 6-8](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html) | DQN agent with target network (with PyTorch) | +| [Code 6-9](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) | Double DQN agent (with TensorFlow) | +| [Code 6-10](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html) | Double DQN agent (with PyTorch) | +| [Code 6-11](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) | Dueling network (with TensorFlow) | +| [Code 6-12](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | Dueling network (with PyTorch) | +| [Code 6-13](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) | Dueling DQN agent (with TensorFlow) | +| [Code 6-14](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | Dueling DQN agent (with PyTorch) | +| [Code 7-1](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) | On-policy VPG agent (with TensorFlow) | +| [Code 7-2](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html) | On-policy VPG agent (with PyTorch) | +| [Code 7-3](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) | On-policy VPG agent with baseline (with TensorFlow) | +| [Code 7-4](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html) | On-policy VPG agent with baseline (with PyTorch) | +| [Code 7-5](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) | Off-policy PG agent (with TensorFlow) | +| [Code 7-6](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html) | Off-policy PG agent (with PyTorch) | +| [Code 7-7](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) | Off-policy PG agent with baseline (with TensorFlow) | +| [Code 7-8](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | Off-policy PG agent with baseline (with PyTorch) | +| [Code 8-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) | Action-value AC agent (with TensorFlow) | +| [Code 8-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html) | Action-value AC agent (with PyTorch) | +| [Code 8-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) | Advantage AC agent (with TensorFlow) | +| [Code 8-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html) | Advantage AC agent (with PyTorch) | +| [Code 8-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) | Eligibility-trace AC agent (with TensorFlow) | +| [Code 8-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html) | Eligibility-trace AC agent (with PyTorch) | +| [Code 8-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) | Replayer for PPO | +| [Code 8-8](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) | PPO agent (with TensorFlow) | +| [Code 8-9](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html) | PPO agent (with PyTorch) | +| [Code 8-10](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) | Calculate CG (with TensorFlow) | +| [Code 8-11](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html) | Calculate CG (with PyTorch) | +| [Code 8-12](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) | NPG agent (with TensorFlow) | +| [Code 8-13](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html) | NPG agent (with PyTorch) | +| [Code 8-14](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) | TRPO agent (with TensorFlow) | +| [Code 8-15](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html) | TRPO agent (with PyTorch) | +| [Code 8-16](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) | OffPAC agent (with TensorFlow) | +| [Code 8-17](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | OffPAC agent (with PyTorch) | +| [Code 9-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) | OU process | +| [Code 9-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) | DDPG agent (with TensorFlow) | +| [Code 9-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html) | DDPG agent (with PyTorch) | +| [Code 9-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) | TD3 agent (with TensorFlow) | +| [Code 9-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | TD3 agent (with PyTorch) | +| [Code 10-1](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | Closed-form solution of `LunarLander-v2` | +| [Code 10-2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | Closed-form solution of `LunarLanderContinuous-v2` | +| [Code 10-3](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) | SQL agent (with TensorFlow) | +| [Code 10-4](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html) | SQL agent (with PyTorch) | +| [Code 10-5](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) | SAC agent (with TensorFlow) | +| [Code 10-6](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html) | SAC agent (with PyTorch) | +| [Code 10-7](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) | SAC with automatic entropy adjustment (with TensorFlow) | +| [Code 10-8](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | SAC with automatic entropy adjustment (with PyTorch) | +| [Code 10-9](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) | SAC with automatic entropy adjustment for continuous action space (with TensorFlow) | +| [Code 10-10](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | SAC with automatic entropy adjustment for continuous action space (with PyTorch) | +| [Code 11-1](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | Closed-form solution of `BipedalWalker-v3` | +| [Code 11-2](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html) | ES agent | +| [Code 11-3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html) | Train and test ES agent | +| [Code 11-4](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | ARS agent | +| [Code 12-1](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | Closed-form solution of `PongNoFrameskip-v4` | +| [Code 12-2](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | Wrapped environment class | +| [Code 12-3](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | Categorical DQN agent (with TensorFlow) | +| [Code 12-4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html) | Categorical DQN agent (with PyTorch) | +| [Code 12-5](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) | QR-DQN agent (with TensorFlow) | +| [Code 12-6](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html) | QR-DQN agent (with PyTorch) | +| [Code 12-7](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) | Quantile network (with TensorFlow) | +| [Code 12-8](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | Quantile network (with PyTorch) | +| [Code 12-9](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) | IQN agent (with TensorFlow) | +| [Code 12-10](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | IQN agent (with PyTorch) | +| [Code 13-1](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | The environment class `BernoulliMABEnv` | +| [Code 13-2](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | Register the environment class `BernoulliMABEnv` into Gym | +| [Code 13-3](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | $\epsilon$-greedy policy agent | +| [Code 13-4](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | Evaluate average regret | +| [Code 13-5](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | UCB1 agent | +| [Code 13-6](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | Bayesian UCB agent | +| [Code 13-7](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | Thompson sampling agent | | [Code 14-1](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | The constructor of the class `BoardGameEnv` | | [Code 14-2](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | The member function `is_valid()`, `has_valid()`, and `get_valid()` in the class `BoardGameEnv` | | [Code 14-3](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/kinarow.py) | The member function `get_winner()` in the class `KInARowEnv` | | [Code 14-4](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | The member function `next_step()` and `get_next_state()` in the class `BoardGameEnv` | | [Code 14-5](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | The member function `reset()`, `step()`, and `render()` in the class `BoardGameEnv` | -| [Code 14-6](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | Exhaustive search agent | -| [Code 14-7](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | Self-play | -| [Code 14-8](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | Replay buffer of AlphaZero agent | -| [Code 14-9](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | Network of AlphaZero agent (with TensorFlow) | -| [Code 14-10](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | Network of AlphaZero agent (with PyTorch) | -| [Code 14-11](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero agent (with TensorFlow) | -| [Code 14-12](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero agent (with PyTorch) | -| [Code 15-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | Adjust the camera | -| [Code 15-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | Visualize the interaction with the environment | -| [Code 15-3](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | Experience replayer for state–action pairs | -| [Code 15-4](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | BC agent (with TensorFlow) | -| [Code 15-5](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html) | BC agent (with PyTorch) | -| [Code 15-6](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | GAIL-PPO agent (with TensorFlow) | -| [Code 15-7](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | GAIL-PPO agent (with PyTorch) | -| [Code 16-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | The environment class `TigerEnv` for the task “Tiger” | -| [Code 16-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | Register the environment class `TigerEnv` | -| [Code 16-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | Optimal policy when discounted factor $\gamma=1$ | -| [Code 16-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | Belief VI | -| [Code 16-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | PBVI | +| [Code 14-6](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | Exhaustive search agent | +| [Code 14-7](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | Self-play | +| [Code 14-8](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | Replay buffer of AlphaZero agent | +| [Code 14-9](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | Network of AlphaZero agent (with TensorFlow) | +| [Code 14-10](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | Network of AlphaZero agent (with PyTorch) | +| [Code 14-11](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero agent (with TensorFlow) | +| [Code 14-12](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero agent (with PyTorch) | +| [Code 15-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | The environment class `TigerEnv` for the task “Tiger” | +| [Code 15-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | Register the environment class `TigerEnv` | +| [Code 15-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | Optimal policy when discounted factor $\gamma=1$ | +| [Code 15-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) | Belief VI | +| [Code 15-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) | PBVI | +| [Code 16-1](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | Adjust the camera | +| [Code 16-2](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | Visualize the interaction with the environment | +| [Code 16-3](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) | Experience replayer for state–action pairs | +| [Code 16-4](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) | BC agent (with TensorFlow) | +| [Code 16-5](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html) | BC agent (with PyTorch) | +| [Code 16-6](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | GAIL-PPO agent (with TensorFlow) | +| [Code 16-7](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | GAIL-PPO agent (with PyTorch) | diff --git a/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html b/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html rename to en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html diff --git a/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.ipynb b/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_AdvantageActorCritic_tf.ipynb rename to en2024/code/Acrobot-v1_AdvantageActorCritic_tf.ipynb diff --git a/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html b/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html rename to en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html diff --git a/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.ipynb b/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_AdvantageActorCritic_torch.ipynb rename to en2024/code/Acrobot-v1_AdvantageActorCritic_torch.ipynb diff --git a/en2023/code/Acrobot-v1_ClosedForm.html b/en2024/code/Acrobot-v1_ClosedForm.html similarity index 100% rename from en2023/code/Acrobot-v1_ClosedForm.html rename to en2024/code/Acrobot-v1_ClosedForm.html diff --git a/en2023/code/Acrobot-v1_ClosedForm.ipynb b/en2024/code/Acrobot-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_ClosedForm.ipynb rename to en2024/code/Acrobot-v1_ClosedForm.ipynb diff --git a/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html b/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html rename to en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html diff --git a/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.ipynb b/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_EligibilityTraceAC_tf.ipynb rename to en2024/code/Acrobot-v1_EligibilityTraceAC_tf.ipynb diff --git a/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html b/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html rename to en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html diff --git a/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.ipynb b/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_EligibilityTraceAC_torch.ipynb rename to en2024/code/Acrobot-v1_EligibilityTraceAC_torch.ipynb diff --git a/en2023/code/Acrobot-v1_NPG_tf.html b/en2024/code/Acrobot-v1_NPG_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_NPG_tf.html rename to en2024/code/Acrobot-v1_NPG_tf.html diff --git a/en2023/code/Acrobot-v1_NPG_tf.ipynb b/en2024/code/Acrobot-v1_NPG_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_NPG_tf.ipynb rename to en2024/code/Acrobot-v1_NPG_tf.ipynb diff --git a/en2023/code/Acrobot-v1_NPG_torch.html b/en2024/code/Acrobot-v1_NPG_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_NPG_torch.html rename to en2024/code/Acrobot-v1_NPG_torch.html diff --git a/en2023/code/Acrobot-v1_NPG_torch.ipynb b/en2024/code/Acrobot-v1_NPG_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_NPG_torch.ipynb rename to en2024/code/Acrobot-v1_NPG_torch.ipynb diff --git a/en2023/code/Acrobot-v1_OffPAC_tf.html b/en2024/code/Acrobot-v1_OffPAC_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_OffPAC_tf.html rename to en2024/code/Acrobot-v1_OffPAC_tf.html diff --git a/en2023/code/Acrobot-v1_OffPAC_tf.ipynb b/en2024/code/Acrobot-v1_OffPAC_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_OffPAC_tf.ipynb rename to en2024/code/Acrobot-v1_OffPAC_tf.ipynb diff --git a/en2023/code/Acrobot-v1_OffPAC_torch.html b/en2024/code/Acrobot-v1_OffPAC_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_OffPAC_torch.html rename to en2024/code/Acrobot-v1_OffPAC_torch.html diff --git a/en2023/code/Acrobot-v1_OffPAC_torch.ipynb b/en2024/code/Acrobot-v1_OffPAC_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_OffPAC_torch.ipynb rename to en2024/code/Acrobot-v1_OffPAC_torch.ipynb diff --git a/en2023/code/Acrobot-v1_PPO_tf.html b/en2024/code/Acrobot-v1_PPO_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_PPO_tf.html rename to en2024/code/Acrobot-v1_PPO_tf.html diff --git a/en2023/code/Acrobot-v1_PPO_tf.ipynb b/en2024/code/Acrobot-v1_PPO_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_PPO_tf.ipynb rename to en2024/code/Acrobot-v1_PPO_tf.ipynb diff --git a/en2023/code/Acrobot-v1_PPO_torch.html b/en2024/code/Acrobot-v1_PPO_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_PPO_torch.html rename to en2024/code/Acrobot-v1_PPO_torch.html diff --git a/en2023/code/Acrobot-v1_PPO_torch.ipynb b/en2024/code/Acrobot-v1_PPO_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_PPO_torch.ipynb rename to en2024/code/Acrobot-v1_PPO_torch.ipynb diff --git a/en2023/code/Acrobot-v1_QActorCritic_tf.html b/en2024/code/Acrobot-v1_QActorCritic_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_QActorCritic_tf.html rename to en2024/code/Acrobot-v1_QActorCritic_tf.html diff --git a/en2023/code/Acrobot-v1_QActorCritic_tf.ipynb b/en2024/code/Acrobot-v1_QActorCritic_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_QActorCritic_tf.ipynb rename to en2024/code/Acrobot-v1_QActorCritic_tf.ipynb diff --git a/en2023/code/Acrobot-v1_QActorCritic_torch.html b/en2024/code/Acrobot-v1_QActorCritic_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_QActorCritic_torch.html rename to en2024/code/Acrobot-v1_QActorCritic_torch.html diff --git a/en2023/code/Acrobot-v1_QActorCritic_torch.ipynb b/en2024/code/Acrobot-v1_QActorCritic_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_QActorCritic_torch.ipynb rename to en2024/code/Acrobot-v1_QActorCritic_torch.ipynb diff --git a/en2023/code/Acrobot-v1_TRPO_tf.html b/en2024/code/Acrobot-v1_TRPO_tf.html similarity index 100% rename from en2023/code/Acrobot-v1_TRPO_tf.html rename to en2024/code/Acrobot-v1_TRPO_tf.html diff --git a/en2023/code/Acrobot-v1_TRPO_tf.ipynb b/en2024/code/Acrobot-v1_TRPO_tf.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_TRPO_tf.ipynb rename to en2024/code/Acrobot-v1_TRPO_tf.ipynb diff --git a/en2023/code/Acrobot-v1_TRPO_torch.html b/en2024/code/Acrobot-v1_TRPO_torch.html similarity index 100% rename from en2023/code/Acrobot-v1_TRPO_torch.html rename to en2024/code/Acrobot-v1_TRPO_torch.html diff --git a/en2023/code/Acrobot-v1_TRPO_torch.ipynb b/en2024/code/Acrobot-v1_TRPO_torch.ipynb similarity index 100% rename from en2023/code/Acrobot-v1_TRPO_torch.ipynb rename to en2024/code/Acrobot-v1_TRPO_torch.ipynb diff --git a/en2023/code/AntBulletEnv-v0_ClosedForm_demo.html b/en2024/code/AntBulletEnv-v0_ClosedForm_demo.html similarity index 100% rename from en2023/code/AntBulletEnv-v0_ClosedForm_demo.html rename to en2024/code/AntBulletEnv-v0_ClosedForm_demo.html diff --git a/en2023/code/AntBulletEnv-v0_ClosedForm_demo.ipynb b/en2024/code/AntBulletEnv-v0_ClosedForm_demo.ipynb similarity index 100% rename from en2023/code/AntBulletEnv-v0_ClosedForm_demo.ipynb rename to en2024/code/AntBulletEnv-v0_ClosedForm_demo.ipynb diff --git a/en2023/code/BernoulliMABEnv-v0_demo.html b/en2024/code/BernoulliMABEnv-v0_demo.html similarity index 100% rename from en2023/code/BernoulliMABEnv-v0_demo.html rename to en2024/code/BernoulliMABEnv-v0_demo.html diff --git a/en2023/code/BernoulliMABEnv-v0_demo.ipynb b/en2024/code/BernoulliMABEnv-v0_demo.ipynb similarity index 100% rename from en2023/code/BernoulliMABEnv-v0_demo.ipynb rename to en2024/code/BernoulliMABEnv-v0_demo.ipynb diff --git a/en2023/code/BipedalWalker-v3_ARS.html b/en2024/code/BipedalWalker-v3_ARS.html similarity index 100% rename from en2023/code/BipedalWalker-v3_ARS.html rename to en2024/code/BipedalWalker-v3_ARS.html diff --git a/en2023/code/BipedalWalker-v3_ARS.ipynb b/en2024/code/BipedalWalker-v3_ARS.ipynb similarity index 100% rename from en2023/code/BipedalWalker-v3_ARS.ipynb rename to en2024/code/BipedalWalker-v3_ARS.ipynb diff --git a/en2023/code/BipedalWalker-v3_ClosedForm.html b/en2024/code/BipedalWalker-v3_ClosedForm.html similarity index 100% rename from en2023/code/BipedalWalker-v3_ClosedForm.html rename to en2024/code/BipedalWalker-v3_ClosedForm.html diff --git a/en2023/code/BipedalWalker-v3_ClosedForm.ipynb b/en2024/code/BipedalWalker-v3_ClosedForm.ipynb similarity index 100% rename from en2023/code/BipedalWalker-v3_ClosedForm.ipynb rename to en2024/code/BipedalWalker-v3_ClosedForm.ipynb diff --git a/en2023/code/BipedalWalker-v3_ES.html b/en2024/code/BipedalWalker-v3_ES.html similarity index 100% rename from en2023/code/BipedalWalker-v3_ES.html rename to en2024/code/BipedalWalker-v3_ES.html diff --git a/en2023/code/BipedalWalker-v3_ES.ipynb b/en2024/code/BipedalWalker-v3_ES.ipynb similarity index 100% rename from en2023/code/BipedalWalker-v3_ES.ipynb rename to en2024/code/BipedalWalker-v3_ES.ipynb diff --git a/en2023/code/Blackjack-v1_ClosedForm.html b/en2024/code/Blackjack-v1_ClosedForm.html similarity index 100% rename from en2023/code/Blackjack-v1_ClosedForm.html rename to en2024/code/Blackjack-v1_ClosedForm.html diff --git a/en2023/code/Blackjack-v1_ClosedForm.ipynb b/en2024/code/Blackjack-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/Blackjack-v1_ClosedForm.ipynb rename to en2024/code/Blackjack-v1_ClosedForm.ipynb diff --git a/en2023/code/Blackjack-v1_MonteCarlo_demo.html b/en2024/code/Blackjack-v1_MonteCarlo_demo.html similarity index 100% rename from en2023/code/Blackjack-v1_MonteCarlo_demo.html rename to en2024/code/Blackjack-v1_MonteCarlo_demo.html diff --git a/en2023/code/Blackjack-v1_MonteCarlo_demo.ipynb b/en2024/code/Blackjack-v1_MonteCarlo_demo.ipynb similarity index 100% rename from en2023/code/Blackjack-v1_MonteCarlo_demo.ipynb rename to en2024/code/Blackjack-v1_MonteCarlo_demo.ipynb diff --git a/en2023/code/BreakoutNoFrameskip-v4_ClosedForm.html b/en2024/code/BreakoutNoFrameskip-v4_ClosedForm.html similarity index 100% rename from en2023/code/BreakoutNoFrameskip-v4_ClosedForm.html rename to en2024/code/BreakoutNoFrameskip-v4_ClosedForm.html diff --git a/en2023/code/BreakoutNoFrameskip-v4_ClosedForm.ipynb b/en2024/code/BreakoutNoFrameskip-v4_ClosedForm.ipynb similarity index 100% rename from en2023/code/BreakoutNoFrameskip-v4_ClosedForm.ipynb rename to en2024/code/BreakoutNoFrameskip-v4_ClosedForm.ipynb diff --git a/en2023/code/CartPole-v0_ClosedForm.html b/en2024/code/CartPole-v0_ClosedForm.html similarity index 100% rename from en2023/code/CartPole-v0_ClosedForm.html rename to en2024/code/CartPole-v0_ClosedForm.html diff --git a/en2023/code/CartPole-v0_ClosedForm.ipynb b/en2024/code/CartPole-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/CartPole-v0_ClosedForm.ipynb rename to en2024/code/CartPole-v0_ClosedForm.ipynb diff --git a/en2023/code/CartPole-v0_OffPolicyVPG_tf.html b/en2024/code/CartPole-v0_OffPolicyVPG_tf.html similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPG_tf.html rename to en2024/code/CartPole-v0_OffPolicyVPG_tf.html diff --git a/en2023/code/CartPole-v0_OffPolicyVPG_tf.ipynb b/en2024/code/CartPole-v0_OffPolicyVPG_tf.ipynb similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPG_tf.ipynb rename to en2024/code/CartPole-v0_OffPolicyVPG_tf.ipynb diff --git a/en2023/code/CartPole-v0_OffPolicyVPG_torch.html b/en2024/code/CartPole-v0_OffPolicyVPG_torch.html similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPG_torch.html rename to en2024/code/CartPole-v0_OffPolicyVPG_torch.html diff --git a/en2023/code/CartPole-v0_OffPolicyVPG_torch.ipynb b/en2024/code/CartPole-v0_OffPolicyVPG_torch.ipynb similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPG_torch.ipynb rename to en2024/code/CartPole-v0_OffPolicyVPG_torch.ipynb diff --git a/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html b/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html rename to en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html diff --git a/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.ipynb b/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.ipynb similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.ipynb rename to en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.ipynb diff --git a/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html b/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html rename to en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html diff --git a/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.ipynb b/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.ipynb similarity index 100% rename from en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.ipynb rename to en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.ipynb diff --git a/en2023/code/CartPole-v0_VPG_tf.html b/en2024/code/CartPole-v0_VPG_tf.html similarity index 100% rename from en2023/code/CartPole-v0_VPG_tf.html rename to en2024/code/CartPole-v0_VPG_tf.html diff --git a/en2023/code/CartPole-v0_VPG_tf.ipynb b/en2024/code/CartPole-v0_VPG_tf.ipynb similarity index 100% rename from en2023/code/CartPole-v0_VPG_tf.ipynb rename to en2024/code/CartPole-v0_VPG_tf.ipynb diff --git a/en2023/code/CartPole-v0_VPG_torch.html b/en2024/code/CartPole-v0_VPG_torch.html similarity index 100% rename from en2023/code/CartPole-v0_VPG_torch.html rename to en2024/code/CartPole-v0_VPG_torch.html diff --git a/en2023/code/CartPole-v0_VPG_torch.ipynb b/en2024/code/CartPole-v0_VPG_torch.ipynb similarity index 100% rename from en2023/code/CartPole-v0_VPG_torch.ipynb rename to en2024/code/CartPole-v0_VPG_torch.ipynb diff --git a/en2023/code/CartPole-v0_VPGwBaseline_tf.html b/en2024/code/CartPole-v0_VPGwBaseline_tf.html similarity index 100% rename from en2023/code/CartPole-v0_VPGwBaseline_tf.html rename to en2024/code/CartPole-v0_VPGwBaseline_tf.html diff --git a/en2023/code/CartPole-v0_VPGwBaseline_tf.ipynb b/en2024/code/CartPole-v0_VPGwBaseline_tf.ipynb similarity index 100% rename from en2023/code/CartPole-v0_VPGwBaseline_tf.ipynb rename to en2024/code/CartPole-v0_VPGwBaseline_tf.ipynb diff --git a/en2023/code/CartPole-v0_VPGwBaseline_torch.html b/en2024/code/CartPole-v0_VPGwBaseline_torch.html similarity index 100% rename from en2023/code/CartPole-v0_VPGwBaseline_torch.html rename to en2024/code/CartPole-v0_VPGwBaseline_torch.html diff --git a/en2023/code/CartPole-v0_VPGwBaseline_torch.ipynb b/en2024/code/CartPole-v0_VPGwBaseline_torch.ipynb similarity index 100% rename from en2023/code/CartPole-v0_VPGwBaseline_torch.ipynb rename to en2024/code/CartPole-v0_VPGwBaseline_torch.ipynb diff --git a/en2023/code/CartPole-v1_ClosedForm.html b/en2024/code/CartPole-v1_ClosedForm.html similarity index 100% rename from en2023/code/CartPole-v1_ClosedForm.html rename to en2024/code/CartPole-v1_ClosedForm.html diff --git a/en2023/code/CartPole-v1_ClosedForm.ipynb b/en2024/code/CartPole-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/CartPole-v1_ClosedForm.ipynb rename to en2024/code/CartPole-v1_ClosedForm.ipynb diff --git a/en2023/code/CliffWalking-v0_Bellman_demo.html b/en2024/code/CliffWalking-v0_Bellman_demo.html similarity index 100% rename from en2023/code/CliffWalking-v0_Bellman_demo.html rename to en2024/code/CliffWalking-v0_Bellman_demo.html diff --git a/en2023/code/CliffWalking-v0_Bellman_demo.ipynb b/en2024/code/CliffWalking-v0_Bellman_demo.ipynb similarity index 100% rename from en2023/code/CliffWalking-v0_Bellman_demo.ipynb rename to en2024/code/CliffWalking-v0_Bellman_demo.ipynb diff --git a/en2023/code/CliffWalking-v0_ClosedForm.html b/en2024/code/CliffWalking-v0_ClosedForm.html similarity index 100% rename from en2023/code/CliffWalking-v0_ClosedForm.html rename to en2024/code/CliffWalking-v0_ClosedForm.html diff --git a/en2023/code/CliffWalking-v0_ClosedForm.ipynb b/en2024/code/CliffWalking-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/CliffWalking-v0_ClosedForm.ipynb rename to en2024/code/CliffWalking-v0_ClosedForm.ipynb diff --git a/en2023/code/FeedAndFull_demo.html b/en2024/code/FeedAndFull_demo.html similarity index 100% rename from en2023/code/FeedAndFull_demo.html rename to en2024/code/FeedAndFull_demo.html diff --git a/en2023/code/FeedAndFull_demo.ipynb b/en2024/code/FeedAndFull_demo.ipynb similarity index 100% rename from en2023/code/FeedAndFull_demo.ipynb rename to en2024/code/FeedAndFull_demo.ipynb diff --git a/en2023/code/FrozenLake-v1_ClosedForm.html b/en2024/code/FrozenLake-v1_ClosedForm.html similarity index 100% rename from en2023/code/FrozenLake-v1_ClosedForm.html rename to en2024/code/FrozenLake-v1_ClosedForm.html diff --git a/en2023/code/FrozenLake-v1_ClosedForm.ipynb b/en2024/code/FrozenLake-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/FrozenLake-v1_ClosedForm.ipynb rename to en2024/code/FrozenLake-v1_ClosedForm.ipynb diff --git a/en2023/code/FrozenLake-v1_DP_demo.html b/en2024/code/FrozenLake-v1_DP_demo.html similarity index 100% rename from en2023/code/FrozenLake-v1_DP_demo.html rename to en2024/code/FrozenLake-v1_DP_demo.html diff --git a/en2023/code/FrozenLake-v1_DP_demo.ipynb b/en2024/code/FrozenLake-v1_DP_demo.ipynb similarity index 100% rename from en2023/code/FrozenLake-v1_DP_demo.ipynb rename to en2024/code/FrozenLake-v1_DP_demo.ipynb diff --git a/en2023/code/FrozenLake8x8-v1_ClosedForm.html b/en2024/code/FrozenLake8x8-v1_ClosedForm.html similarity index 100% rename from en2023/code/FrozenLake8x8-v1_ClosedForm.html rename to en2024/code/FrozenLake8x8-v1_ClosedForm.html diff --git a/en2023/code/FrozenLake8x8-v1_ClosedForm.ipynb b/en2024/code/FrozenLake8x8-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/FrozenLake8x8-v1_ClosedForm.ipynb rename to en2024/code/FrozenLake8x8-v1_ClosedForm.ipynb diff --git a/en2023/code/GaussianMABEnv_demo.html b/en2024/code/GaussianMABEnv_demo.html similarity index 100% rename from en2023/code/GaussianMABEnv_demo.html rename to en2024/code/GaussianMABEnv_demo.html diff --git a/en2023/code/GaussianMABEnv_demo.ipynb b/en2024/code/GaussianMABEnv_demo.ipynb similarity index 100% rename from en2023/code/GaussianMABEnv_demo.ipynb rename to en2024/code/GaussianMABEnv_demo.ipynb diff --git a/en2023/code/GuessingGame-v0_ClosedForm.html b/en2024/code/GuessingGame-v0_ClosedForm.html similarity index 100% rename from en2023/code/GuessingGame-v0_ClosedForm.html rename to en2024/code/GuessingGame-v0_ClosedForm.html diff --git a/en2023/code/GuessingGame-v0_ClosedForm.ipynb b/en2024/code/GuessingGame-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/GuessingGame-v0_ClosedForm.ipynb rename to en2024/code/GuessingGame-v0_ClosedForm.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_BC_tf.html b/en2024/code/HumanoidBulletEnv-v0_BC_tf.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_BC_tf.html rename to en2024/code/HumanoidBulletEnv-v0_BC_tf.html diff --git a/en2023/code/HumanoidBulletEnv-v0_BC_tf.ipynb b/en2024/code/HumanoidBulletEnv-v0_BC_tf.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_BC_tf.ipynb rename to en2024/code/HumanoidBulletEnv-v0_BC_tf.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_BC_torch.html b/en2024/code/HumanoidBulletEnv-v0_BC_torch.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_BC_torch.html rename to en2024/code/HumanoidBulletEnv-v0_BC_torch.html diff --git a/en2023/code/HumanoidBulletEnv-v0_BC_torch.ipynb b/en2024/code/HumanoidBulletEnv-v0_BC_torch.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_BC_torch.ipynb rename to en2024/code/HumanoidBulletEnv-v0_BC_torch.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html b/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html rename to en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html diff --git a/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.ipynb b/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.ipynb rename to en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html b/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html rename to en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.ipynb b/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.ipynb rename to en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html b/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html rename to en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.ipynb b/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.ipynb rename to en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.ipynb diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILTRPO_torch.html b/en2024/code/HumanoidBulletEnv-v0_GAILTRPO_torch.html similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILTRPO_torch.html rename to en2024/code/HumanoidBulletEnv-v0_GAILTRPO_torch.html diff --git a/en2023/code/HumanoidBulletEnv-v0_GAILTRPO_torch.ipynb b/en2024/code/HumanoidBulletEnv-v0_GAILTRPO_torch.ipynb similarity index 100% rename from en2023/code/HumanoidBulletEnv-v0_GAILTRPO_torch.ipynb rename to en2024/code/HumanoidBulletEnv-v0_GAILTRPO_torch.ipynb diff --git a/en2023/code/HungryFull_demo.html b/en2024/code/HungryFull_demo.html similarity index 100% rename from en2023/code/HungryFull_demo.html rename to en2024/code/HungryFull_demo.html diff --git a/en2023/code/HungryFull_demo.ipynb b/en2024/code/HungryFull_demo.ipynb similarity index 100% rename from en2023/code/HungryFull_demo.ipynb rename to en2024/code/HungryFull_demo.ipynb diff --git a/en2023/code/LunarLander-v2_ClosedForm.html b/en2024/code/LunarLander-v2_ClosedForm.html similarity index 100% rename from en2023/code/LunarLander-v2_ClosedForm.html rename to en2024/code/LunarLander-v2_ClosedForm.html diff --git a/en2023/code/LunarLander-v2_ClosedForm.ipynb b/en2024/code/LunarLander-v2_ClosedForm.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_ClosedForm.ipynb rename to en2024/code/LunarLander-v2_ClosedForm.ipynb diff --git a/en2023/code/LunarLander-v2_SACwA_tf.html b/en2024/code/LunarLander-v2_SACwA_tf.html similarity index 100% rename from en2023/code/LunarLander-v2_SACwA_tf.html rename to en2024/code/LunarLander-v2_SACwA_tf.html diff --git a/en2023/code/LunarLander-v2_SACwA_tf.ipynb b/en2024/code/LunarLander-v2_SACwA_tf.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SACwA_tf.ipynb rename to en2024/code/LunarLander-v2_SACwA_tf.ipynb diff --git a/en2023/code/LunarLander-v2_SACwA_torch.html b/en2024/code/LunarLander-v2_SACwA_torch.html similarity index 100% rename from en2023/code/LunarLander-v2_SACwA_torch.html rename to en2024/code/LunarLander-v2_SACwA_torch.html diff --git a/en2023/code/LunarLander-v2_SACwA_torch.ipynb b/en2024/code/LunarLander-v2_SACwA_torch.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SACwA_torch.ipynb rename to en2024/code/LunarLander-v2_SACwA_torch.ipynb diff --git a/en2023/code/LunarLander-v2_SACwoA_tf.html b/en2024/code/LunarLander-v2_SACwoA_tf.html similarity index 100% rename from en2023/code/LunarLander-v2_SACwoA_tf.html rename to en2024/code/LunarLander-v2_SACwoA_tf.html diff --git a/en2023/code/LunarLander-v2_SACwoA_tf.ipynb b/en2024/code/LunarLander-v2_SACwoA_tf.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SACwoA_tf.ipynb rename to en2024/code/LunarLander-v2_SACwoA_tf.ipynb diff --git a/en2023/code/LunarLander-v2_SACwoA_torch.html b/en2024/code/LunarLander-v2_SACwoA_torch.html similarity index 100% rename from en2023/code/LunarLander-v2_SACwoA_torch.html rename to en2024/code/LunarLander-v2_SACwoA_torch.html diff --git a/en2023/code/LunarLander-v2_SACwoA_torch.ipynb b/en2024/code/LunarLander-v2_SACwoA_torch.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SACwoA_torch.ipynb rename to en2024/code/LunarLander-v2_SACwoA_torch.ipynb diff --git a/en2023/code/LunarLander-v2_SQL_tf.html b/en2024/code/LunarLander-v2_SQL_tf.html similarity index 100% rename from en2023/code/LunarLander-v2_SQL_tf.html rename to en2024/code/LunarLander-v2_SQL_tf.html diff --git a/en2023/code/LunarLander-v2_SQL_tf.ipynb b/en2024/code/LunarLander-v2_SQL_tf.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SQL_tf.ipynb rename to en2024/code/LunarLander-v2_SQL_tf.ipynb diff --git a/en2023/code/LunarLander-v2_SQL_torch.html b/en2024/code/LunarLander-v2_SQL_torch.html similarity index 100% rename from en2023/code/LunarLander-v2_SQL_torch.html rename to en2024/code/LunarLander-v2_SQL_torch.html diff --git a/en2023/code/LunarLander-v2_SQL_torch.ipynb b/en2024/code/LunarLander-v2_SQL_torch.ipynb similarity index 100% rename from en2023/code/LunarLander-v2_SQL_torch.ipynb rename to en2024/code/LunarLander-v2_SQL_torch.ipynb diff --git a/en2023/code/LunarLanderContinuous-v2_ClosedForm.html b/en2024/code/LunarLanderContinuous-v2_ClosedForm.html similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_ClosedForm.html rename to en2024/code/LunarLanderContinuous-v2_ClosedForm.html diff --git a/en2023/code/LunarLanderContinuous-v2_ClosedForm.ipynb b/en2024/code/LunarLanderContinuous-v2_ClosedForm.ipynb similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_ClosedForm.ipynb rename to en2024/code/LunarLanderContinuous-v2_ClosedForm.ipynb diff --git a/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html b/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_SACwA_tf.html rename to en2024/code/LunarLanderContinuous-v2_SACwA_tf.html diff --git a/en2023/code/LunarLanderContinuous-v2_SACwA_tf.ipynb b/en2024/code/LunarLanderContinuous-v2_SACwA_tf.ipynb similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_SACwA_tf.ipynb rename to en2024/code/LunarLanderContinuous-v2_SACwA_tf.ipynb diff --git a/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html b/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_SACwA_torch.html rename to en2024/code/LunarLanderContinuous-v2_SACwA_torch.html diff --git a/en2023/code/LunarLanderContinuous-v2_SACwA_torch.ipynb b/en2024/code/LunarLanderContinuous-v2_SACwA_torch.ipynb similarity index 100% rename from en2023/code/LunarLanderContinuous-v2_SACwA_torch.ipynb rename to en2024/code/LunarLanderContinuous-v2_SACwA_torch.ipynb diff --git a/en2023/code/MountainCar-v0_ClosedForm.html b/en2024/code/MountainCar-v0_ClosedForm.html similarity index 100% rename from en2023/code/MountainCar-v0_ClosedForm.html rename to en2024/code/MountainCar-v0_ClosedForm.html diff --git a/en2023/code/MountainCar-v0_ClosedForm.ipynb b/en2024/code/MountainCar-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_ClosedForm.ipynb rename to en2024/code/MountainCar-v0_ClosedForm.ipynb diff --git a/en2023/code/MountainCar-v0_DQN_tf.html b/en2024/code/MountainCar-v0_DQN_tf.html similarity index 100% rename from en2023/code/MountainCar-v0_DQN_tf.html rename to en2024/code/MountainCar-v0_DQN_tf.html diff --git a/en2023/code/MountainCar-v0_DQN_tf.ipynb b/en2024/code/MountainCar-v0_DQN_tf.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DQN_tf.ipynb rename to en2024/code/MountainCar-v0_DQN_tf.ipynb diff --git a/en2023/code/MountainCar-v0_DQN_torch.html b/en2024/code/MountainCar-v0_DQN_torch.html similarity index 100% rename from en2023/code/MountainCar-v0_DQN_torch.html rename to en2024/code/MountainCar-v0_DQN_torch.html diff --git a/en2023/code/MountainCar-v0_DQN_torch.ipynb b/en2024/code/MountainCar-v0_DQN_torch.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DQN_torch.ipynb rename to en2024/code/MountainCar-v0_DQN_torch.ipynb diff --git a/en2023/code/MountainCar-v0_DoubleDQN_tf.html b/en2024/code/MountainCar-v0_DoubleDQN_tf.html similarity index 100% rename from en2023/code/MountainCar-v0_DoubleDQN_tf.html rename to en2024/code/MountainCar-v0_DoubleDQN_tf.html diff --git a/en2023/code/MountainCar-v0_DoubleDQN_tf.ipynb b/en2024/code/MountainCar-v0_DoubleDQN_tf.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DoubleDQN_tf.ipynb rename to en2024/code/MountainCar-v0_DoubleDQN_tf.ipynb diff --git a/en2023/code/MountainCar-v0_DoubleDQN_torch.html b/en2024/code/MountainCar-v0_DoubleDQN_torch.html similarity index 100% rename from en2023/code/MountainCar-v0_DoubleDQN_torch.html rename to en2024/code/MountainCar-v0_DoubleDQN_torch.html diff --git a/en2023/code/MountainCar-v0_DoubleDQN_torch.ipynb b/en2024/code/MountainCar-v0_DoubleDQN_torch.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DoubleDQN_torch.ipynb rename to en2024/code/MountainCar-v0_DoubleDQN_torch.ipynb diff --git a/en2023/code/MountainCar-v0_DuelDQN_tf.html b/en2024/code/MountainCar-v0_DuelDQN_tf.html similarity index 100% rename from en2023/code/MountainCar-v0_DuelDQN_tf.html rename to en2024/code/MountainCar-v0_DuelDQN_tf.html diff --git a/en2023/code/MountainCar-v0_DuelDQN_tf.ipynb b/en2024/code/MountainCar-v0_DuelDQN_tf.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DuelDQN_tf.ipynb rename to en2024/code/MountainCar-v0_DuelDQN_tf.ipynb diff --git a/en2023/code/MountainCar-v0_DuelDQN_torch.html b/en2024/code/MountainCar-v0_DuelDQN_torch.html similarity index 100% rename from en2023/code/MountainCar-v0_DuelDQN_torch.html rename to en2024/code/MountainCar-v0_DuelDQN_torch.html diff --git a/en2023/code/MountainCar-v0_DuelDQN_torch.ipynb b/en2024/code/MountainCar-v0_DuelDQN_torch.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_DuelDQN_torch.ipynb rename to en2024/code/MountainCar-v0_DuelDQN_torch.ipynb diff --git a/en2023/code/MountainCar-v0_SARSA.html b/en2024/code/MountainCar-v0_SARSA.html similarity index 100% rename from en2023/code/MountainCar-v0_SARSA.html rename to en2024/code/MountainCar-v0_SARSA.html diff --git a/en2023/code/MountainCar-v0_SARSA.ipynb b/en2024/code/MountainCar-v0_SARSA.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_SARSA.ipynb rename to en2024/code/MountainCar-v0_SARSA.ipynb diff --git a/en2023/code/MountainCar-v0_SARSA_demo.html b/en2024/code/MountainCar-v0_SARSA_demo.html similarity index 100% rename from en2023/code/MountainCar-v0_SARSA_demo.html rename to en2024/code/MountainCar-v0_SARSA_demo.html diff --git a/en2023/code/MountainCar-v0_SARSA_demo.ipynb b/en2024/code/MountainCar-v0_SARSA_demo.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_SARSA_demo.ipynb rename to en2024/code/MountainCar-v0_SARSA_demo.ipynb diff --git a/en2023/code/MountainCar-v0_SARSAlambda.html b/en2024/code/MountainCar-v0_SARSAlambda.html similarity index 100% rename from en2023/code/MountainCar-v0_SARSAlambda.html rename to en2024/code/MountainCar-v0_SARSAlambda.html diff --git a/en2023/code/MountainCar-v0_SARSAlambda.ipynb b/en2024/code/MountainCar-v0_SARSAlambda.ipynb similarity index 100% rename from en2023/code/MountainCar-v0_SARSAlambda.ipynb rename to en2024/code/MountainCar-v0_SARSAlambda.ipynb diff --git a/en2023/code/MountainCarContinuous-v0_ClosedForm.html b/en2024/code/MountainCarContinuous-v0_ClosedForm.html similarity index 100% rename from en2023/code/MountainCarContinuous-v0_ClosedForm.html rename to en2024/code/MountainCarContinuous-v0_ClosedForm.html diff --git a/en2023/code/MountainCarContinuous-v0_ClosedForm.ipynb b/en2024/code/MountainCarContinuous-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/MountainCarContinuous-v0_ClosedForm.ipynb rename to en2024/code/MountainCarContinuous-v0_ClosedForm.ipynb diff --git a/en2023/code/Pendulum-v1_ClosedForm.html b/en2024/code/Pendulum-v1_ClosedForm.html similarity index 100% rename from en2023/code/Pendulum-v1_ClosedForm.html rename to en2024/code/Pendulum-v1_ClosedForm.html diff --git a/en2023/code/Pendulum-v1_ClosedForm.ipynb b/en2024/code/Pendulum-v1_ClosedForm.ipynb similarity index 100% rename from en2023/code/Pendulum-v1_ClosedForm.ipynb rename to en2024/code/Pendulum-v1_ClosedForm.ipynb diff --git a/en2023/code/Pendulum-v1_DDPG_tf.html b/en2024/code/Pendulum-v1_DDPG_tf.html similarity index 100% rename from en2023/code/Pendulum-v1_DDPG_tf.html rename to en2024/code/Pendulum-v1_DDPG_tf.html diff --git a/en2023/code/Pendulum-v1_DDPG_tf.ipynb b/en2024/code/Pendulum-v1_DDPG_tf.ipynb similarity index 100% rename from en2023/code/Pendulum-v1_DDPG_tf.ipynb rename to en2024/code/Pendulum-v1_DDPG_tf.ipynb diff --git a/en2023/code/Pendulum-v1_DDPG_torch.html b/en2024/code/Pendulum-v1_DDPG_torch.html similarity index 100% rename from en2023/code/Pendulum-v1_DDPG_torch.html rename to en2024/code/Pendulum-v1_DDPG_torch.html diff --git a/en2023/code/Pendulum-v1_DDPG_torch.ipynb b/en2024/code/Pendulum-v1_DDPG_torch.ipynb similarity index 100% rename from en2023/code/Pendulum-v1_DDPG_torch.ipynb rename to en2024/code/Pendulum-v1_DDPG_torch.ipynb diff --git a/en2023/code/Pendulum-v1_TD3_tf.html b/en2024/code/Pendulum-v1_TD3_tf.html similarity index 100% rename from en2023/code/Pendulum-v1_TD3_tf.html rename to en2024/code/Pendulum-v1_TD3_tf.html diff --git a/en2023/code/Pendulum-v1_TD3_tf.ipynb b/en2024/code/Pendulum-v1_TD3_tf.ipynb similarity index 100% rename from en2023/code/Pendulum-v1_TD3_tf.ipynb rename to en2024/code/Pendulum-v1_TD3_tf.ipynb diff --git a/en2023/code/Pendulum-v1_TD3_torch.html b/en2024/code/Pendulum-v1_TD3_torch.html similarity index 100% rename from en2023/code/Pendulum-v1_TD3_torch.html rename to en2024/code/Pendulum-v1_TD3_torch.html diff --git a/en2023/code/Pendulum-v1_TD3_torch.ipynb b/en2024/code/Pendulum-v1_TD3_torch.ipynb similarity index 100% rename from en2023/code/Pendulum-v1_TD3_torch.ipynb rename to en2024/code/Pendulum-v1_TD3_torch.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html b/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html rename to en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html diff --git a/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.ipynb b/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.ipynb rename to en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html b/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html rename to en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html diff --git a/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.ipynb b/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.ipynb rename to en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_ClosedForm.html b/en2024/code/PongNoFrameskip-v4_ClosedForm.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_ClosedForm.html rename to en2024/code/PongNoFrameskip-v4_ClosedForm.html diff --git a/en2023/code/PongNoFrameskip-v4_ClosedForm.ipynb b/en2024/code/PongNoFrameskip-v4_ClosedForm.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_ClosedForm.ipynb rename to en2024/code/PongNoFrameskip-v4_ClosedForm.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_IQN_tf.html b/en2024/code/PongNoFrameskip-v4_IQN_tf.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_IQN_tf.html rename to en2024/code/PongNoFrameskip-v4_IQN_tf.html diff --git a/en2023/code/PongNoFrameskip-v4_IQN_tf.ipynb b/en2024/code/PongNoFrameskip-v4_IQN_tf.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_IQN_tf.ipynb rename to en2024/code/PongNoFrameskip-v4_IQN_tf.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_IQN_torch.html b/en2024/code/PongNoFrameskip-v4_IQN_torch.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_IQN_torch.html rename to en2024/code/PongNoFrameskip-v4_IQN_torch.html diff --git a/en2023/code/PongNoFrameskip-v4_IQN_torch.ipynb b/en2024/code/PongNoFrameskip-v4_IQN_torch.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_IQN_torch.ipynb rename to en2024/code/PongNoFrameskip-v4_IQN_torch.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html b/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_QRDQN_tf.html rename to en2024/code/PongNoFrameskip-v4_QRDQN_tf.html diff --git a/en2023/code/PongNoFrameskip-v4_QRDQN_tf.ipynb b/en2024/code/PongNoFrameskip-v4_QRDQN_tf.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_QRDQN_tf.ipynb rename to en2024/code/PongNoFrameskip-v4_QRDQN_tf.ipynb diff --git a/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html b/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html similarity index 100% rename from en2023/code/PongNoFrameskip-v4_QRDQN_torch.html rename to en2024/code/PongNoFrameskip-v4_QRDQN_torch.html diff --git a/en2023/code/PongNoFrameskip-v4_QRDQN_torch.ipynb b/en2024/code/PongNoFrameskip-v4_QRDQN_torch.ipynb similarity index 100% rename from en2023/code/PongNoFrameskip-v4_QRDQN_torch.ipynb rename to en2024/code/PongNoFrameskip-v4_QRDQN_torch.ipynb diff --git a/en2023/code/Taxi-v3_ClosedForm.html b/en2024/code/Taxi-v3_ClosedForm.html similarity index 100% rename from en2023/code/Taxi-v3_ClosedForm.html rename to en2024/code/Taxi-v3_ClosedForm.html diff --git a/en2023/code/Taxi-v3_ClosedForm.ipynb b/en2024/code/Taxi-v3_ClosedForm.ipynb similarity index 100% rename from en2023/code/Taxi-v3_ClosedForm.ipynb rename to en2024/code/Taxi-v3_ClosedForm.ipynb diff --git a/en2023/code/Taxi-v3_DoubleQLearning.html b/en2024/code/Taxi-v3_DoubleQLearning.html similarity index 100% rename from en2023/code/Taxi-v3_DoubleQLearning.html rename to en2024/code/Taxi-v3_DoubleQLearning.html diff --git a/en2023/code/Taxi-v3_DoubleQLearning.ipynb b/en2024/code/Taxi-v3_DoubleQLearning.ipynb similarity index 100% rename from en2023/code/Taxi-v3_DoubleQLearning.ipynb rename to en2024/code/Taxi-v3_DoubleQLearning.ipynb diff --git a/en2023/code/Taxi-v3_ExpectedSARSA.html b/en2024/code/Taxi-v3_ExpectedSARSA.html similarity index 100% rename from en2023/code/Taxi-v3_ExpectedSARSA.html rename to en2024/code/Taxi-v3_ExpectedSARSA.html diff --git a/en2023/code/Taxi-v3_ExpectedSARSA.ipynb b/en2024/code/Taxi-v3_ExpectedSARSA.ipynb similarity index 100% rename from en2023/code/Taxi-v3_ExpectedSARSA.ipynb rename to en2024/code/Taxi-v3_ExpectedSARSA.ipynb diff --git a/en2023/code/Taxi-v3_QLearning.html b/en2024/code/Taxi-v3_QLearning.html similarity index 100% rename from en2023/code/Taxi-v3_QLearning.html rename to en2024/code/Taxi-v3_QLearning.html diff --git a/en2023/code/Taxi-v3_QLearning.ipynb b/en2024/code/Taxi-v3_QLearning.ipynb similarity index 100% rename from en2023/code/Taxi-v3_QLearning.ipynb rename to en2024/code/Taxi-v3_QLearning.ipynb diff --git a/en2023/code/Taxi-v3_SARSA.html b/en2024/code/Taxi-v3_SARSA.html similarity index 100% rename from en2023/code/Taxi-v3_SARSA.html rename to en2024/code/Taxi-v3_SARSA.html diff --git a/en2023/code/Taxi-v3_SARSA.ipynb b/en2024/code/Taxi-v3_SARSA.ipynb similarity index 100% rename from en2023/code/Taxi-v3_SARSA.ipynb rename to en2024/code/Taxi-v3_SARSA.ipynb diff --git a/en2023/code/Taxi-v3_SARSALambda.html b/en2024/code/Taxi-v3_SARSALambda.html similarity index 100% rename from en2023/code/Taxi-v3_SARSALambda.html rename to en2024/code/Taxi-v3_SARSALambda.html diff --git a/en2023/code/Taxi-v3_SARSALambda.ipynb b/en2024/code/Taxi-v3_SARSALambda.ipynb similarity index 100% rename from en2023/code/Taxi-v3_SARSALambda.ipynb rename to en2024/code/Taxi-v3_SARSALambda.ipynb diff --git a/en2023/code/Taxi-v3_SARSA_demo.html b/en2024/code/Taxi-v3_SARSA_demo.html similarity index 100% rename from en2023/code/Taxi-v3_SARSA_demo.html rename to en2024/code/Taxi-v3_SARSA_demo.html diff --git a/en2023/code/Taxi-v3_SARSA_demo.ipynb b/en2024/code/Taxi-v3_SARSA_demo.ipynb similarity index 100% rename from en2023/code/Taxi-v3_SARSA_demo.ipynb rename to en2024/code/Taxi-v3_SARSA_demo.ipynb diff --git a/en2023/code/TicTacToe-v0_AlphaZero_tf.html b/en2024/code/TicTacToe-v0_AlphaZero_tf.html similarity index 100% rename from en2023/code/TicTacToe-v0_AlphaZero_tf.html rename to en2024/code/TicTacToe-v0_AlphaZero_tf.html diff --git a/en2023/code/TicTacToe-v0_AlphaZero_tf.ipynb b/en2024/code/TicTacToe-v0_AlphaZero_tf.ipynb similarity index 100% rename from en2023/code/TicTacToe-v0_AlphaZero_tf.ipynb rename to en2024/code/TicTacToe-v0_AlphaZero_tf.ipynb diff --git a/en2023/code/TicTacToe-v0_AlphaZero_torch.html b/en2024/code/TicTacToe-v0_AlphaZero_torch.html similarity index 100% rename from en2023/code/TicTacToe-v0_AlphaZero_torch.html rename to en2024/code/TicTacToe-v0_AlphaZero_torch.html diff --git a/en2023/code/TicTacToe-v0_AlphaZero_torch.ipynb b/en2024/code/TicTacToe-v0_AlphaZero_torch.ipynb similarity index 100% rename from en2023/code/TicTacToe-v0_AlphaZero_torch.ipynb rename to en2024/code/TicTacToe-v0_AlphaZero_torch.ipynb diff --git a/en2023/code/TicTacToe-v0_ExhaustiveSearch.html b/en2024/code/TicTacToe-v0_ExhaustiveSearch.html similarity index 100% rename from en2023/code/TicTacToe-v0_ExhaustiveSearch.html rename to en2024/code/TicTacToe-v0_ExhaustiveSearch.html diff --git a/en2023/code/TicTacToe-v0_ExhaustiveSearch.ipynb b/en2024/code/TicTacToe-v0_ExhaustiveSearch.ipynb similarity index 100% rename from en2023/code/TicTacToe-v0_ExhaustiveSearch.ipynb rename to en2024/code/TicTacToe-v0_ExhaustiveSearch.ipynb diff --git a/en2023/code/Tiger-v0_ClosedForm.html b/en2024/code/Tiger-v0_ClosedForm.html similarity index 100% rename from en2023/code/Tiger-v0_ClosedForm.html rename to en2024/code/Tiger-v0_ClosedForm.html diff --git a/en2023/code/Tiger-v0_ClosedForm.ipynb b/en2024/code/Tiger-v0_ClosedForm.ipynb similarity index 100% rename from en2023/code/Tiger-v0_ClosedForm.ipynb rename to en2024/code/Tiger-v0_ClosedForm.ipynb diff --git a/en2023/code/Tiger-v0_Plan_demo.html b/en2024/code/Tiger-v0_Plan_demo.html similarity index 100% rename from en2023/code/Tiger-v0_Plan_demo.html rename to en2024/code/Tiger-v0_Plan_demo.html diff --git a/en2023/code/Tiger-v0_Plan_demo.ipynb b/en2024/code/Tiger-v0_Plan_demo.ipynb similarity index 100% rename from en2023/code/Tiger-v0_Plan_demo.ipynb rename to en2024/code/Tiger-v0_Plan_demo.ipynb diff --git a/en2023/code_zh.md b/en2024/code_zh.md similarity index 99% rename from en2023/code_zh.md rename to en2024/code_zh.md index 8df08d7..f845f0d 100644 --- a/en2023/code_zh.md +++ b/en2024/code_zh.md @@ -126,15 +126,15 @@ | [代码14-10](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero用的网络(PyTorch版本) | | [代码14-11](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero智能体(TensorFlow版本) | | [代码14-12](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero智能体(PyTorch版本) | -| [代码15-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 调整摄像头 | -| [代码15-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 与环境交互并渲染 | -| [代码15-3](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 状态动作对的经验回放 | -| [代码15-4](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 行为克隆模仿学习智能体(TensorFlow版本) | -| [代码15-5](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html) | 行为克隆模仿学习智能体(PyTorch版本) | -| [代码15-6](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | 生成对抗模仿学习邻近策略优化算法智能体(TensorFlow版本) | -| [代码15-7](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | 生成对抗模仿学习邻近策略优化算法智能体(PyTorch版本) | -| [代码16-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 任务“老虎”环境类`TigerEnv` | -| [代码16-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 注册环境类`TigerEnv` | -| [代码16-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 折扣因子 $\gamma=1$ 时的最优策略 | -| [代码16-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 信念价值迭代 | -| [代码16-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 用基于点的价值迭代求解 | +| [代码15-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 任务“老虎”环境类`TigerEnv` | +| [代码15-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 注册环境类`TigerEnv` | +| [代码15-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 折扣因子 $\gamma=1$ 时的最优策略 | +| [代码15-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 信念价值迭代 | +| [代码15-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 用基于点的价值迭代求解 | +| [代码16-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 调整摄像头 | +| [代码16-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 与环境交互并渲染 | +| [代码16-3](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 状态动作对的经验回放 | +| [代码16-4](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 行为克隆模仿学习智能体(TensorFlow版本) | +| [代码16-5](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html) | 行为克隆模仿学习智能体(PyTorch版本) | +| [代码16-6](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | 生成对抗模仿学习邻近策略优化算法智能体(TensorFlow版本) | +| [代码16-7](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | 生成对抗模仿学习邻近策略优化算法智能体(PyTorch版本) | diff --git a/en2023/cover.jpg b/en2024/cover.jpg similarity index 100% rename from en2023/cover.jpg rename to en2024/cover.jpg diff --git a/en2023/gym.md b/en2024/gym.md similarity index 83% rename from en2023/gym.md rename to en2024/gym.md index ececfb6..27c2a4a 100644 --- a/en2023/gym.md +++ b/en2024/gym.md @@ -6,18 +6,18 @@ ## Table of Gym Internal -| section | \# | class | codes | note | +| section | class | codes | note | | --- | --- | --- | --- | --- | -| Section 1.6.2 | Gym Internal 1-1 | environment class `gym.Env` | [core.py](https://github.com/openai/gym/blob/master/gym/core.py) | [Note](#environment-classes) | -| Section 1.6.2 | Gym Internal 1-2 | space class `gym.space.Space` | [space.py](https://github.com/openai/gym/blob/master/gym/spaces/space.py) | [Note](#space-classes) | -| Section 1.6.2 | Gym Internal 1-3 | space class `gym.space.Box` | [box.py](https://github.com/openai/gym/blob/master/gym/spaces/box.py) | [Note](#the-class-box) | -| Section 1.6.2 | Gym Internal 1-4 | space class `gym.space.Discrete` | [discrete.py](https://github.com/openai/gym/blob/master/gym/spaces/discrete.py) | [Note](#the-class-discrete) | -| Section 1.6.2 | Gym Internal 1-5 | wrapper class `gym.Wrapper` | [core.py](https://github.com/openai/gym/blob/master/gym/core.py) | [Note](#wrapper-classes) | -| Section 1.6.2 | Gym Internal 1-6 | wrapper class `gym.wrapper.TimeLimit` | [time_limit.py](https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py) | [Note](#the-class-timelimit) | -| Section 4.3.1 | Gym Internal 4-1 | space class `gym.space.Tuple` | [tuple.py](https://github.com/openai/gym/blob/master/gym/spaces/tuple.py) | [Note](#the-class-tuple) | -| Section 11.3.1 | Gym Internal 11-1 | wrapper class `gym.wrapper.TransformReward` | [transform_reward.py](https://github.com/openai/gym/blob/master/gym/wrappers/transform_reward.py) | [Note](#the-class-transformreward) | -| Section 12.6.3 | Gym Internal 12-1 | wrapper class `gym.wrapper.AtariPreprocessing` | [atari_preprocessing.py](https://github.com/openai/gym/blob/master/gym/wrappers/atari_preprocessing.py) | [Note](#the-class-ataripreprocessing) | -| Section 12.6.3 | Gym Internal 12-2 | wrapper class `gym.wrapper.FrameStack` | [frame_stack.py](https://github.com/openai/gym/blob/master/gym/wrappers/frame_stack.py) | [Note](#the-class-framestack) | +| Section 1.6.2 | environment class `gym.Env` | [core.py](https://github.com/openai/gym/blob/master/gym/core.py) | [Note](#environment-classes) | +| Section 1.6.2 | space class `gym.space.Space` | [space.py](https://github.com/openai/gym/blob/master/gym/spaces/space.py) | [Note](#space-classes) | +| Section 1.6.2 | space class `gym.space.Box` | [box.py](https://github.com/openai/gym/blob/master/gym/spaces/box.py) | [Note](#the-class-box) | +| Section 1.6.2 | space class `gym.space.Discrete` | [discrete.py](https://github.com/openai/gym/blob/master/gym/spaces/discrete.py) | [Note](#the-class-discrete) | +| Section 1.6.2 | wrapper class `gym.Wrapper` | [core.py](https://github.com/openai/gym/blob/master/gym/core.py) | [Note](#wrapper-classes) | +| Section 1.6.2 | wrapper class `gym.wrapper.TimeLimit` | [time_limit.py](https://github.com/openai/gym/blob/master/gym/wrappers/time_limit.py) | [Note](#the-class-timelimit) | +| Section 4.3.1 | space class `gym.space.Tuple` | [tuple.py](https://github.com/openai/gym/blob/master/gym/spaces/tuple.py) | [Note](#the-class-tuple) | +| Section 11.3.1 | wrapper class `gym.wrapper.TransformReward` | [transform_reward.py](https://github.com/openai/gym/blob/master/gym/wrappers/transform_reward.py) | [Note](#the-class-transformreward) | +| Section 12.6.3 | wrapper class `gym.wrapper.AtariPreprocessing` | [atari_preprocessing.py](https://github.com/openai/gym/blob/master/gym/wrappers/atari_preprocessing.py) | [Note](#the-class-ataripreprocessing) | +| Section 12.6.3 | wrapper class `gym.wrapper.FrameStack` | [frame_stack.py](https://github.com/openai/gym/blob/master/gym/wrappers/frame_stack.py) | [Note](#the-class-framestack) | ## Environment Classes diff --git a/en2023/notation.md b/en2024/notation.md similarity index 94% rename from en2023/notation.md rename to en2024/notation.md index 83c9463..f77bf2c 100644 --- a/en2023/notation.md +++ b/en2024/notation.md @@ -41,6 +41,7 @@ In the sequel are notations throughout the book. We also occasionally follow oth | $\mathbf{g}$ | gradient vector | | $h$ | action preference | | $\text{H}$ | entropy | +| $\mathbf{I}$ | identity matrix | | $k$ | index of iteration | | $\ell$ | loss | | $\mathbb{N}$ | set of natural numbers | @@ -48,6 +49,7 @@ In the sequel are notations throughout the book. We also occasionally follow oth | $O$, $\tilde{O}$ | infinite in asymptotic notations | | $\mathsfit{O}$, $\mathsfit{o}$ | observation | | $\mathcal{O}$ | observation space | +| $\mathbf{O}$ | zero matrix | | $p$ | probability, dynamics | | $\mathbf{P}$ | transition matrix | | $\Pr$ | probability | @@ -91,12 +93,14 @@ In the sequel are notations throughout the book. We also occasionally follow oth | $\pi_ \ast$ | optimal policy | | $\pi_ \text{E}$ | expert policy in imitation learning | | $\rho$ | state–action visitation frequency; important sampling ratio in off-policy learning | -| $\phi$ | quantile | +| $\phi$ | quantile in distributional RL | | $\boldsymbol\uprho$ | vector representation of state–action visitation frequency | | $\huge\tau$, $\tau$ | sojourn time of SMDP | | $\mathit\Psi$ | Generalized Advantage Estimate (GAE) | -| $\mathit\Omega$, $\omega$ | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks | +| $\mathit\Omega$, $\omega$ | cumulative probability in distributional RL; (lower case only) conditional probability for partially observable tasks | | **Other Notations** | **Description** | +| $\mathbf{0}$ | zero vector | +| $\mathbf{1}$ | a vector all of whose entries are one | | $\stackrel{\text{a.e.}}{=}$ | equal almost everywhere | | $\stackrel{\text{d}}{=}$ | share the same distribution | | $\stackrel{\text{def}}{=}$ | define | diff --git a/en2023/notation_zh.md b/en2024/notation_zh.md similarity index 94% rename from en2023/notation_zh.md rename to en2024/notation_zh.md index 2d0efba..c0b28ca 100644 --- a/en2023/notation_zh.md +++ b/en2024/notation_zh.md @@ -41,6 +41,7 @@ | $\mathbf{g}$ | 梯度向量 | gradient vector | | $h$ | 动作偏好 | action preference | | $\text{H}$ | 熵 | entropy | +| $\mathbf{I}$ | 单位矩阵 | identity matrix | | $k$ | 迭代次数指标 | index of iteration | | $\ell$ | 损失 | loss | | $\mathbb{N}$ | 自然数集 | set of natural numbers | @@ -48,6 +49,7 @@ | $O$, $\tilde{O}$ | 渐近无穷大 | infinite in asymptotic notations | | $\mathsfit{O}$, $\mathsfit{o}$ | 观测 | observation | | $\mathcal{O}$ | 观测空间 | observation space | +| $\mathbf{O}$ | 零矩阵 | zero matrix | | $p$ | 概率值,动力 | probability, dynamics | | $\mathbf{P}$ | 转移矩阵 | transition matrix | | $\Pr$ | 概率 | probability | @@ -91,12 +93,14 @@ | $\pi_ \ast$ | 最优策略 | optimal policy | | $\pi_ \text{E}$ | 模仿学习中的专家策略 | expert policy in imitation learning | | $\rho$ | 状态动作对访问频次;异策算法中的重要性采样比率 | state–action visitation frequency; important sampling ratio in off-policy learning | -| $\phi$ | 分位数 | quantile | +| $\phi$ | 分位数 | quantile in distributional RL | | $\boldsymbol\uprho$ | 状态动作对访问频次的向量表示 | vector representation of state–action visitation frequency | | $\huge\tau$, $\tau$ | 半Markov决策过程中的逗留时间 | sojourn time of SMDP | | $\mathit\Psi$ | 扩展的优势估计 | Generalized Advantage Estimate (GAE) | -| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率 | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks | +| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率 | cumulative probability in distributional RL; (lower case only) conditional probability for partially observable tasks | | **其他符号** | **含义** | **英文含义** | +| $\mathbf{0}$ | 零向量 | zero vector | +| $\mathbf{1}$ | 各元素均为1的向量 | a vector all of whose entries are one | | $\stackrel{\text{a.e.}}{=}$ | 几乎处处相等 | equal almost everywhere | | $\stackrel{\text{d}}{=}$ | 分布相同 | share the same distribution | | $\stackrel{\text{def}}{=}$ | 定义 | define | diff --git a/en2023/setup/setupmac.md b/en2024/setup/setupmac.md similarity index 92% rename from en2023/setup/setupmac.md rename to en2024/setup/setupmac.md index 7b58d97..0b809dc 100644 --- a/en2023/setup/setupmac.md +++ b/en2024/setup/setupmac.md @@ -4,13 +4,13 @@ This article introduces how to set up developing environment in a macOS (w/o or ## Part 1: Minimum Installation -This part will show how to set up a minimum environment. After this step, you are able to run codes in Chapter 1-5, 13, 16. +This part will show how to set up a minimum environment. After this step, you are able to run codes in Chapter 1-5, 13, 15. #### Install Anaconda 3 **Steps:** -- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2023.09-0-MacOSX-x86_64.pkg` (or `Anaconda3-2023.09-0-MacOSX-amd64.pkg` for M chip), and the size is about 0.6 GB. +- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2024.02-1-MacOSX-x86_64.pkg` (or `Anaconda3-2024.02-1-MacOSX-amd64.pkg` for M chip), and the size is about 0.6 GB. - Double click the installer to start the install wizard and install accordingly. The free space of the disk should be at least 13GB. (If the free space of the disk is too little, you may still be able to install Anaconda 3 itself, but you may not have enough free space in the follow-up steps. 13GB is the storage requirements for all steps in this article.) Record the location of Anaconda installation. The default location is `/opt/anaconda3`. We will use the location in the sequal. #### Create a New Conda Environment @@ -67,7 +67,7 @@ This step is strongly recommended but not compulsory. ## Part 2: Install TensorFlow and/or PyTorch -This part will show how to install TensorFlow and/or PyTorch upon the minimum environment in Part 1. Codes in Chapter 6-10, 12, 14, 15 need TensorFlow and/or PyTorch. After this step, you are able to run codes in Chapter 1-9, 13, 16. +This part will show how to install TensorFlow and/or PyTorch upon the minimum environment in Part 1. Codes in Chapter 6-10, 12, 14, 16 need TensorFlow and/or PyTorch. After this step, you are able to run codes in Chapter 1-9, 13, 15. Please install the latest version of Xcode from AppStore. @@ -97,7 +97,7 @@ This book only needs CPU version. You can of course install GPU versions, which Codes in Chapter 10-11 use `gym[box2d]`. You can skip this part if you do not care the codes in Chapter 10-11. It does not impact other parts. -This part will show how to install `gym[box2d]` upon the environment with PyTorch and/or TensorFlow in Part 2. Upon completed this part, you are able to run codes in Chapter 1-13, 16. If you complete all of Part 3.1-3.3, you are able to run codes in all chapters. +This part will show how to install `gym[box2d]` upon the environment with PyTorch and/or TensorFlow in Part 2. Upon completed this part, you are able to run codes in Chapter 1-13, 15. If you complete all of Part 3.1-3.3, you are able to run codes in all chapters. Please install the latest version of Xcode from AppStore. @@ -142,7 +142,7 @@ Please install the latest version of Xcode from AppStore. ## Part 3.3: Install PyBullet -Codes in Chapter 15 use PyBullet. You can skip this part if you do not care the codes in Chapter 15. +Codes in Chapter 16 use PyBullet. You can skip this part if you do not care the codes in Chapter 16. Since PyBullet depends on an old version of Gym, so it is better install it in a new conda environment so that it will not pollute current conda environment. diff --git a/en2023/setup/setupwin.md b/en2024/setup/setupwin.md similarity index 94% rename from en2023/setup/setupwin.md rename to en2024/setup/setupwin.md index ee92b01..90b3b6f 100644 --- a/en2023/setup/setupwin.md +++ b/en2024/setup/setupwin.md @@ -4,13 +4,13 @@ This article introduces how to set up developing environment in a Windows 10/11 ## Part 1: Minimum Installation -This part will show how to set up a minimum environment. After this step, you are able to run codes in Chapter 1-5, 13, 16. +This part will show how to set up a minimum environment. After this step, you are able to run codes in Chapter 1-5, 13, 15. #### Install Anaconda 3 **Steps:** -- Download the installer on https://www.anaconda.com/products/distribution (Pick Windows version for Windows users).The name of installer is alike `Anaconda3-2023.09-0-Windows-x86_64.exe`, and the size is about 0.9 GB. +- Download the installer on https://www.anaconda.com/products/distribution (Pick Windows version for Windows users).The name of installer is alike `Anaconda3-2024.02-1-Windows-x86_64.exe`, and the size is about 0.9 GB. - Double click the installer to start the install wizard and install accordingly. The free space of the disk should be at least 13GB. (If the free space of the disk is too little, you may still be able to install Anaconda 3 itself, but you may not have enough free space in the follow-up steps. 13GB is the storage requirements for all steps in this article except Visual Studio.) Record the location of Anaconda installation. The default location is `C:%HOMEPATH%\anaconda3`. We will use the location in the sequal. #### Create a New Conda Environment @@ -63,7 +63,7 @@ This step is strongly recommended but not compulsory. ## Part 2: Install TensorFlow and/or PyTorch -This part will show how to install TensorFlow and/or PyTorch upon the minimum environment in Part 1. Codes in Chapter 6-10, 12, 14, 15 need TensorFlow and/or PyTorch. After this step, you are able to run codes in Chapter 1-9, 13, 16. +This part will show how to install TensorFlow and/or PyTorch upon the minimum environment in Part 1. Codes in Chapter 6-10, 12, 14, 16 need TensorFlow and/or PyTorch. After this step, you are able to run codes in Chapter 1-9, 13, 15. #### Install Visual Studio @@ -102,7 +102,7 @@ Please install the latest version of Visual Studio before this step. Otherwise, Codes in Chapter 10-11 use `gym[box2d]`. You can skip this part if you do not care the codes in Chapter 10-11. It does not impact other parts. -This part will show how to install `gym[box2d]` upon the environment with PyTorch and/or TensorFlow in Part 2. Upon completed this part, you are able to run codes in Chapter 1-13, 16. If you complete all of Part 3.1-3.3, you are able to run codes in all chapters. +This part will show how to install `gym[box2d]` upon the environment with PyTorch and/or TensorFlow in Part 2. Upon completed this part, you are able to run codes in Chapter 1-13, 15. If you complete all of Part 3.1-3.3, you are able to run codes in all chapters. #### Install SWIG @@ -136,7 +136,7 @@ This part will show how to install `gym[box2d]` upon the environment with PyTorc ## Part 3.3: Install PyBullet -Codes in Chapter 15 use PyBullet. You can skip this part if you do not care the codes in Chapter 15. +Codes in Chapter 16 use PyBullet. You can skip this part if you do not care the codes in Chapter 16. Since PyBullet depends on an old version of Gym, so it is better install it in a new conda environment so that it will not pollute current conda environment. diff --git a/en2023/toc.md b/en2024/toc.md similarity index 81% rename from en2023/toc.md rename to en2024/toc.md index c15de78..2779546 100644 --- a/en2023/toc.md +++ b/en2024/toc.md @@ -12,7 +12,7 @@ 1.4.1. Task-Based Taxonomy -1.4.2. Taxonomy based on Algorithm +1.4.2. Algorithm-based Taxonomy #### 1.5. Performance Metrics @@ -52,17 +52,23 @@ 2.2.2. Properties of Values +2.2.3. Calculate Values + +2.2.4. Calculate Initial Expected Return using Values + 2.2.3. Partial Order of Policy and Policy Improvement -#### 2.3. Discounted Visitation Frequency +#### 2.3. Visitation Frequency -2.3.1. Definition of Discounted Visitation Frequency +2.3.1. Definition of Visitation Frequency -2.3.2. Properties of Discounted Visitation Frequency +2.3.2. Properties of Visitation Frequency -2.3.3. Equivalence between Discounted Visitation Frequency and Policy +2.3.3. Calculate Visitation Frequency -2.3.4. Expectation over Discounted Distribution +2.3.4. Equivalence between Discounted Visitation Frequency and Policy + +2.3.5. Expectation over Discounted Distribution #### 2.4. Optimal Policy and Optimal Value @@ -70,9 +76,9 @@ 2.4.2. Existence of Optimal Policy -2.4.3. Properties of Optimal Values and Bellman Optimal Equations +2.4.3. Properties of Optimal Values -2.4.4. LP: Linear Programming Method +2.4.4. Calculate Optimal Values 2.4.5. Use Optimal Values to Find Optimal Strategy @@ -326,7 +332,7 @@ 8.4.4. TRPO: Trust Region Policy Optimization -#### 8.5. Importance Sampling Off-policy AC +#### 8.5. Importance Sampling Off-Policy AC #### 8.6. Case Study: Acrobot @@ -582,89 +588,100 @@ 14.5.3. Mock Interview -## 15. IL: Imitation Learning +## 15. More Agent–Environment Interfaces + +#### 15.1. Average Reward DTMDP + +15.1.1. Average Reward + +15.1.2. Differential Values + +15.1.3. Optimal Policy + +#### 15.2. CTMDP: Continuous-Time MDP -#### 15.1. $f$-Divergences and their Properties +#### 15.3. Non-Stationary MDP -#### 15.2. BC: Behavior Cloning +15.3.1. Representation of Non-Stationary States -#### 15.3. GAIL: Generative Adversarial Imitation Learning +15.3.2. Bounded Time Index -#### 15.4. Case Study: Humanoid +15.3.3. Unbounded Time Index -15.4.1. The Library PyBullet +#### 15.4. SMDP: Semi-MDP -15.4.2. Use BC to IL +15.4.1. SMDP and its Values -15.4.3. Use GAIL to IL +15.4.2. Find Optimal Policy -#### 15.5. Summary +15.4.3. HRL: Hierarchical Reinforcement Learning -#### 15.6. Exercises +#### 15.5. POMDP: Partially Observable Markov Decision Process -15.6.1. Multiple Choices +15.5.1. DTPOMDP: Discrete-Time POMDP -15.6.2. Programming +15.5.2. Belief -15.6.3. Mock Interview +15.5.3. Belief MDP -## 16. More Agent–Environment Interfaces +15.5.4. Belief Values -#### 16.1. Average Reward DTMDP +15.5.5. Belief Values for Finite POMDP -16.1.1. Average Reward +15.5.6. Use Memory -16.1.2. Differential Values +#### 15.6. Case Study: Tiger -16.1.3. Optimal Policy +15.6.1. Compare Discounted Return Expectation and Average Reward -#### 16.2. CTMDP: Continuous-Time MDP +15.6.2. Belief MDP -#### 16.3. Non-Stationary MDP +15.6.3. Non-Stationary Belief State Values -16.3.1. Representation of Non-Stationary States +#### 15.7. Summary -16.3.2. Bounded Time Index +#### 15.8. Exercises -16.3.3. Unbounded Time Index +15.8.1. Multiple Choices -#### 16.4. SMDP: Semi-MDP +15.8.2. Programming -16.4.1. SMDP and its Values +15.8.3. Mock Interview -16.4.2. Find Optimal Policy +## 16. Learning from Feedback and Imitation Learning -16.4.3. HRL: Hierarchical Reinforcement Learning +#### 16.1 Learning from Feedback -#### 16.5. POMDP: Partially Observable Markov Decision Process +16.1.1. Reward Model -16.5.1. DTPOMDP: Discrete-Time POMDP +16.1.2. PbRL: Preference-based RL -16.5.2. Belief +16.1.3. RLHF: Reinforcement Learning with Human Feedback -16.5.3. Belief MDP +#### 16.2 IL: Imitation Learning -16.5.4. Belief Values +16.2.1. $f$-Divergences and their Properties -16.5.5. Belief Values for Finite POMDP +16.2.2. BC: Behavior Cloning -16.5.6. Use Memory +16.2.3. GAIL: Generative Adversarial Imitation Learning -#### 16.6. Case Study: Tiger +#### 16.3 Application in Training GPT -16.6.1. Compare Discounted Return Expectation and Average Reward +#### 16.4. Case Study: Humanoid -16.6.2. Belief MDP +16.4.1. Use PyBullet -16.6.3. Non-Stationary Belief State Values +16.4.2. Use BC to IL -#### 16.7. Summary +16.4.3. Use GAIL to IL -#### 16.8. Exercises +#### 16.5. Summary -16.8.1. Multiple Choices +#### 16.6. Exercises -16.8.2. Programming +16.6.1. Multiple Choices -16.8.3. Mock Interview +16.6.2. Programming +16.6.3. Mock Interview diff --git a/zh2023/README.md b/zh2023/README.md index b07d329..bac2be2 100644 --- a/zh2023/README.md +++ b/zh2023/README.md @@ -22,23 +22,23 @@ | 章 | 环境和闭式解 | 智能体 | | :--- | :--- | :--- | -| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | -| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | -| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | -| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | -| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | -| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | -| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | -| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | -| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | -| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | -| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | -| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | -| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | -| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2023/code/GaussianMABEnv_demo.html) | -| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | -| 15 注 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | -| 16 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) +| 2 | [CliffWalking-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_ClosedForm.html) | [Bellman](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | +| 3 | [FrozenLake-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_ClosedForm.html)| [DP](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | +| 4 | [Blackjack-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_ClosedForm.html) | [MC](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | +| 5 | [Taxi-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html), [ExpectedSARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html), [QL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html), [DoubleQL](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | +| 6 | [MountainCar-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | [SARSA](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA.html), [SARSA(λ)](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html), DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html), DoubleDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html), DuelDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | +| 7 | [CartPole-0](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_ClosedForm.html) | VPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html), VPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html), OffPolicyVPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html), OffPolicyVPGwBaseline [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | +| 8 | [Acrobot-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_ClosedForm.html) | QAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html), AdvantageAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html), EligibilityTraceAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html), PPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html), NPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html), TRPO [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html), OffPAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | +| 9 | [Pendulum-v1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_ClosedForm.html) | DDPG [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html), TD3 [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | +| 10 | [LunarLander-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | SQL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html), SAC [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html), SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | +| 10 | [LunarLanderContinuous-v2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | SACwA [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | +| 11 | [BipedalWalker-v3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | [ES](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html), [ARS](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | +| 12 | [PongNoFrameskip-v4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | CategoricalDQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html), QR-DQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html), IQN [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | +| 13 | [BernoulliMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | +| 13 | [GaussianMAB-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | [UCB](https://zhiqingxiao.github.io/rl-book/en2024/code/GaussianMABEnv_demo.html) | +| 14 | [TicTacToe-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | AlphaZero [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | +| 15 注 | [HumanoidBulletEnv-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | BehaviorClone [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html), GAIL [tf](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) [torch](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | +| 16 | [Tiger-v0](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | [VI](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) 注: diff --git a/zh2023/code.md b/zh2023/code.md index 8df08d7..323e4a2 100644 --- a/zh2023/code.md +++ b/zh2023/code.md @@ -2,139 +2,139 @@ | \# | 代码内容 | | --- | --- | -| [代码1-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 查看`MountainCar-v0`的观测空间和动作空间 | -| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | -| [代码1-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 智能体和环境交互一个回合的代码 | -| [代码1-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_ClosedForm.html) | 运行100回合求平均以测试性能 | -| [代码1-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | 查看`MountainCarContinuous-v0`的观测空间和动作空间 | -| [代码1-6](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCarContinuous-v0_ClosedForm.html) | 用于求解`MountainCarContinous-v0`的智能体 | -| [代码2-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HungryFull_demo.html) | 求解示例Bellman期望方程 | -| [代码2-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HungryFull_demo.html) | 求解示例Bellman最优方程 | -| [代码2-3](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | 导入`CliffWalking-v0`环境并查看环境信息 | -| [代码2-4](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | 用Bellman方程求解状态价值和动作价值 | -| [代码2-5](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | 用线性规划求解最优价值 | -| [代码2-6](https://zhiqingxiao.github.io/rl-book/en2023/code/CliffWalking-v0_Bellman_demo.html) | 用最优动作价值确定最优确定性策略 | -| [代码3-1](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 导入`FrozenLake-v1`并查看基本信息 | -| [代码3-2](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 用策略执行一个回合 | -| [代码3-3](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 统计随机策略的回合奖励 | -| [代码3-4](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 策略评估的实现 | -| [代码3-5](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 对随机策略进行策略评估 | -| [代码3-6](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 策略改进的实现 | -| [代码3-7](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 对随机策略进行策略改进 | -| [代码3-8](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 策略迭代的实现 | -| [代码3-9](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 利用策略迭代求解最优策略并测试 | -| [代码3-10](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 价值迭代的实现 | -| [代码3-11](https://zhiqingxiao.github.io/rl-book/en2023/code/FrozenLake-v1_DP_demo.html) | 用价值迭代算法求解最优策略 | -| [代码4-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 玩一个回合 | -| [代码4-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 同策回合更新策略评估 | -| [代码4-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 绘制以状态为指标的3维数组 | -| [代码4-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 带起始探索的同策回合更新 | -| [代码4-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 基于柔性策略的同策回合更新 | -| [代码4-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 重要性采样策略评估 | -| [代码4-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Blackjack-v1_MonteCarlo_demo.html) | 柔性策略重要性采样最优策略求解 | -| [代码5-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | 初始化环境并可视化 | -| [代码5-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | SARSA算法智能体的实现 | -| [代码5-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSA_demo.html) | 训练智能体 | -| [代码5-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_ExpectedSARSA.html) | 期望SARSA算法智能体 | -| [代码5-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_QLearning.html) | Q学习智能体 | -| [代码5-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_DoubleQLearning.html) | 双重Q学习智能体 | -| [代码5-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Taxi-v3_SARSALambda.html) | SARSA $(\lambda)$ 算法智能体 | -| [代码6-1](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | 导入小车上山环境 | -| [代码6-2](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | 总是向右施力的智能体 | -| [代码6-3](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | 砖瓦编码 | -| [代码6-4](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSA_demo.html) | 函数近似SARSA算法智能体 | -| [代码6-5](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_SARSAlambda.html)| 函数近似SARSA $(\lambda)$ 智能体 | -| [代码6-6](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) | 经验回放的实现 | -| [代码6-7](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_tf.html) | 带目标网络的深度Q网络智能体(TensorFlow版本) | -| [代码6-8](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DQN_torch.html) | 带目标网络的深度Q网络智能体(PyTorch版本) | -| [代码6-9](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_tf.html) | 双重深度Q网络智能体(TensorFlow版本) | -| [代码6-10](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DoubleDQN_torch.html) | 双重深度Q网络智能体(PyTorch版本) | -| [代码6-11](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) | 决斗网络(TensorFlow版本) | -| [代码6-12](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | 决斗网络(PyTorch版本) | -| [代码6-13](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_tf.html) | 决斗Q网络智能体(TensorFlow 版本) | -| [代码6-14](https://zhiqingxiao.github.io/rl-book/en2023/code/MountainCar-v0_DuelDQN_torch.html) | 决斗Q网络智能体(PyTorch版本) | -| [代码7-1](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_tf.html) | 同策策略梯度算法智能体(TensorFlow版本) | -| [代码7-2](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPG_torch.html) | 同策策略梯度算法智能体(PyTorch版本) | -| [代码7-3](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_tf.html) | 带基线的同策策略梯度算法智能体(TensorFlow版本) | -| [代码7-4](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_VPGwBaseline_torch.html) | 带基线的同策策略梯度算法智能体(PyTorch版本) | -| [代码7-5](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_tf.html) | 异策策略梯度算法智能体(TensorFlow版本) | -| [代码7-6](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPG_torch.html) | 异策策略梯度算法智能体(PyTorch版本) | -| [代码7-7](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) | 带基线的异策策略梯度算法智能体(TensorFlow版本) | -| [代码7-8](https://zhiqingxiao.github.io/rl-book/en2023/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | 带基线的异策策略梯度算法智能体(PyTorch版本) | -| [代码8-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_tf.html) | 动作价值执行者/评论者算法(TensorFlow版本) | -| [代码8-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_QActorCritic_torch.html) | 动作价值执行者/评论者算法(PyTorch版本) | -| [代码8-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_tf.html) | 优势执行者/评论者算法的智能体实现(TensorFlow版本) | -| [代码8-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_AdvantageActorCritic_torch.html) | 优势执行者/评论者算法的智能体实现(PyTorch版本) | -| [代码8-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_tf.html) | 带资格迹的执行者/评论者(TensorFlow版本) | -| [代码8-6](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_EligibilityTraceAC_torch.html) | 带资格迹的执行者/评论者(PyTorch版本) | -| [代码8-7](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) | 邻近策略优化的经验回放类 | -| [代码8-8](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_tf.html) | 邻近策略优化算法智能体(TensorFlow版本) | -| [代码8-9](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_PPO_torch.html) | 邻近策略优化算法智能体(PyTorch版本) | -| [代码8-10](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) | 计算共轭梯度(TensorFlow版本) | -| [代码8-11](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html) | 计算共轭梯度(PyTorch版本) | -| [代码8-12](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_tf.html) | 自然策略梯度算法智能体(TensorFlow版本) | -| [代码8-13](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_NPG_torch.html) | 自然策略梯度算法智能体(PyTorch版本) | -| [代码8-14](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_tf.html) | 信赖域策略优化算法智能体(TensorFlow版本) | -| [代码8-15](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_TRPO_torch.html) | 信赖域策略优化算法智能体(PyTorch版本) | -| [代码8-16](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_tf.html) | 异策执行者/评论者算法智能体(TensorFlow版本) | -| [代码8-17](https://zhiqingxiao.github.io/rl-book/en2023/code/Acrobot-v1_OffPAC_torch.html) | 异策执行者/评论者算法智能体(PyTorch版本) | -| [代码9-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) | OU过程 | -| [代码9-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_tf.html) | 深度确定性策略梯度算法的智能体(TensorFlow版本) | -| [代码9-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_DDPG_torch.html) | 深度确定性策略梯度算法的智能体(PyTorch版本) | -| [代码9-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_tf.html) | 双重延迟深度确定性算法智能体(TensorFlow版) | -| [代码9-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Pendulum-v1_TD3_torch.html) | 双重延迟深度确定性算法智能体(PyTorch版) | -| [代码10-1](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_ClosedForm.html) | `LunarLander-v2`的闭式解 | -| [代码10-2](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_ClosedForm.html) | `LunarLanderContinuous-v2`的闭式解 | -| [代码10-3](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_tf.html) | 柔性Q学习智能体(使用TensorFlow) | -| [代码10-4](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SQL_torch.html) | 柔性Q学习智能体(使用PyTorch) | -| [代码10-5](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_tf.html) | 柔性执行者/评论者算法智能体(TensorFlow版) | -| [代码10-6](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwoA_torch.html) | 柔性执行者/评论者算法智能体(PyTorch版) | -| [代码10-7](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_tf.html) | 带自动熵调节的柔性执行者/评论者算法智能体(TensorFlow版) | -| [代码10-8](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLander-v2_SACwA_torch.html) | 带自动熵调节的柔性执行者/评论者算法智能体(PyTorch版) | -| [代码10-9](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_tf.html) | 用于连续动作空间的带自动熵调节的柔性执行者/评论者算法(使用TensorFlow) | -| [代码10-10](https://zhiqingxiao.github.io/rl-book/en2023/code/LunarLanderContinuous-v2_SACwA_torch.html) | 用于连续动作空间的带自动熵调节的柔性执行者/评论者算法(使用PyTorch) | -| [代码11-1](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ClosedForm.html) | `BipedalWalker-v3`的闭式解 | -| [代码11-2](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html) | 进化算法智能体 | -| [代码11-3](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ES.html) | 训练和测试进化算法智能体 | -| [代码11-4](https://zhiqingxiao.github.io/rl-book/en2023/code/BipedalWalker-v3_ARS.html) | 增强随机搜索算法智能体 | -| [代码12-1](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_ClosedForm.html) | `PongNoFrameskip-v4`的闭式解 | -| [代码12-2](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | 包装后的环境类 | -| [代码12-3](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | 类别深度Q网络算法智能体(TensorFlow版本) | -| [代码12-4](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_CategoricalDQN_torch.html) | 类别深度Q网络算法智能体(PyTorch版本) | -| [代码12-5](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_tf.html) | 分位数回归深度Q网络算法智能体(TensorFlow版本) | -| [代码12-6](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_QRDQN_torch.html) | 分位数回归深度Q网络算法智能体(PyTorch版本) | -| [代码12-7](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) | 分位网络(TensorFlow版本) | -| [代码12-8](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | 分位网络(PyTorch版本) | -| [代码12-9](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_tf.html) | 含蓄分位网络智能体(TensorFlow版本) | -| [代码12-10](https://zhiqingxiao.github.io/rl-book/en2023/code/PongNoFrameskip-v4_IQN_torch.html) | 含蓄分位网络智能体(PyTorch版本) | -| [代码13-1](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 环境类`BernoulliMABEnv` | -| [代码13-2](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 将环境类`BernoulliMABEnv`注册到Gym库里 | -| [代码13-3](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 用 $\epsilon$ 贪心策略求解 | -| [代码13-4](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 估计平均遗憾 | -| [代码13-5](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 用第一置信上界求解 | -| [代码13-6](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 用Bayesian置信上界求解 | -| [代码13-7](https://zhiqingxiao.github.io/rl-book/en2023/code/BernoulliMABEnv-v0_demo.html) | 用Thompson采样求解 | +| [代码1-1](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | 查看`MountainCar-v0`的观测空间和动作空间 | +| [代码1-2](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | 根据指定确定性策略决定动作的智能体,用于`MountainCar-v0` | +| [代码1-3](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | 智能体和环境交互一个回合的代码 | +| [代码1-4](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_ClosedForm.html) | 运行100回合求平均以测试性能 | +| [代码1-5](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCarContinuous-v0_ClosedForm.html) | 查看`MountainCarContinuous-v0`的观测空间和动作空间 | +| [代码1-6](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCarContinuous-v0_ClosedForm.html) | 用于求解`MountainCarContinous-v0`的智能体 | +| [代码2-1](https://zhiqingxiao.github.io/rl-book/en2024/code/HungryFull_demo.html) | 求解示例Bellman期望方程 | +| [代码2-2](https://zhiqingxiao.github.io/rl-book/en2024/code/HungryFull_demo.html) | 求解示例Bellman最优方程 | +| [代码2-3](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | 导入`CliffWalking-v0`环境并查看环境信息 | +| [代码2-4](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | 用Bellman方程求解状态价值和动作价值 | +| [代码2-5](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | 用线性规划求解最优价值 | +| [代码2-6](https://zhiqingxiao.github.io/rl-book/en2024/code/CliffWalking-v0_Bellman_demo.html) | 用最优动作价值确定最优确定性策略 | +| [代码3-1](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 导入`FrozenLake-v1`并查看基本信息 | +| [代码3-2](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 用策略执行一个回合 | +| [代码3-3](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 统计随机策略的回合奖励 | +| [代码3-4](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 策略评估的实现 | +| [代码3-5](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 对随机策略进行策略评估 | +| [代码3-6](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 策略改进的实现 | +| [代码3-7](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 对随机策略进行策略改进 | +| [代码3-8](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 策略迭代的实现 | +| [代码3-9](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 利用策略迭代求解最优策略并测试 | +| [代码3-10](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 价值迭代的实现 | +| [代码3-11](https://zhiqingxiao.github.io/rl-book/en2024/code/FrozenLake-v1_DP_demo.html) | 用价值迭代算法求解最优策略 | +| [代码4-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 玩一个回合 | +| [代码4-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 同策回合更新策略评估 | +| [代码4-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 绘制以状态为指标的3维数组 | +| [代码4-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 带起始探索的同策回合更新 | +| [代码4-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 基于柔性策略的同策回合更新 | +| [代码4-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 重要性采样策略评估 | +| [代码4-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Blackjack-v1_MonteCarlo_demo.html) | 柔性策略重要性采样最优策略求解 | +| [代码5-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | 初始化环境并可视化 | +| [代码5-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | SARSA算法智能体的实现 | +| [代码5-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSA_demo.html) | 训练智能体 | +| [代码5-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_ExpectedSARSA.html) | 期望SARSA算法智能体 | +| [代码5-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_QLearning.html) | Q学习智能体 | +| [代码5-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_DoubleQLearning.html) | 双重Q学习智能体 | +| [代码5-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Taxi-v3_SARSALambda.html) | SARSA $(\lambda)$ 算法智能体 | +| [代码6-1](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | 导入小车上山环境 | +| [代码6-2](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | 总是向右施力的智能体 | +| [代码6-3](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | 砖瓦编码 | +| [代码6-4](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSA_demo.html) | 函数近似SARSA算法智能体 | +| [代码6-5](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_SARSAlambda.html)| 函数近似SARSA $(\lambda)$ 智能体 | +| [代码6-6](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) | 经验回放的实现 | +| [代码6-7](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_tf.html) | 带目标网络的深度Q网络智能体(TensorFlow版本) | +| [代码6-8](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DQN_torch.html) | 带目标网络的深度Q网络智能体(PyTorch版本) | +| [代码6-9](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_tf.html) | 双重深度Q网络智能体(TensorFlow版本) | +| [代码6-10](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DoubleDQN_torch.html) | 双重深度Q网络智能体(PyTorch版本) | +| [代码6-11](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) | 决斗网络(TensorFlow版本) | +| [代码6-12](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | 决斗网络(PyTorch版本) | +| [代码6-13](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_tf.html) | 决斗Q网络智能体(TensorFlow 版本) | +| [代码6-14](https://zhiqingxiao.github.io/rl-book/en2024/code/MountainCar-v0_DuelDQN_torch.html) | 决斗Q网络智能体(PyTorch版本) | +| [代码7-1](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_tf.html) | 同策策略梯度算法智能体(TensorFlow版本) | +| [代码7-2](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPG_torch.html) | 同策策略梯度算法智能体(PyTorch版本) | +| [代码7-3](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_tf.html) | 带基线的同策策略梯度算法智能体(TensorFlow版本) | +| [代码7-4](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_VPGwBaseline_torch.html) | 带基线的同策策略梯度算法智能体(PyTorch版本) | +| [代码7-5](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_tf.html) | 异策策略梯度算法智能体(TensorFlow版本) | +| [代码7-6](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPG_torch.html) | 异策策略梯度算法智能体(PyTorch版本) | +| [代码7-7](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_tf.html) | 带基线的异策策略梯度算法智能体(TensorFlow版本) | +| [代码7-8](https://zhiqingxiao.github.io/rl-book/en2024/code/CartPole-v0_OffPolicyVPGwBaseline_torch.html) | 带基线的异策策略梯度算法智能体(PyTorch版本) | +| [代码8-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_tf.html) | 动作价值执行者/评论者算法(TensorFlow版本) | +| [代码8-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_QActorCritic_torch.html) | 动作价值执行者/评论者算法(PyTorch版本) | +| [代码8-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_tf.html) | 优势执行者/评论者算法的智能体实现(TensorFlow版本) | +| [代码8-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_AdvantageActorCritic_torch.html) | 优势执行者/评论者算法的智能体实现(PyTorch版本) | +| [代码8-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_tf.html) | 带资格迹的执行者/评论者(TensorFlow版本) | +| [代码8-6](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_EligibilityTraceAC_torch.html) | 带资格迹的执行者/评论者(PyTorch版本) | +| [代码8-7](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) | 邻近策略优化的经验回放类 | +| [代码8-8](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_tf.html) | 邻近策略优化算法智能体(TensorFlow版本) | +| [代码8-9](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_PPO_torch.html) | 邻近策略优化算法智能体(PyTorch版本) | +| [代码8-10](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) | 计算共轭梯度(TensorFlow版本) | +| [代码8-11](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html) | 计算共轭梯度(PyTorch版本) | +| [代码8-12](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_tf.html) | 自然策略梯度算法智能体(TensorFlow版本) | +| [代码8-13](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_NPG_torch.html) | 自然策略梯度算法智能体(PyTorch版本) | +| [代码8-14](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_tf.html) | 信赖域策略优化算法智能体(TensorFlow版本) | +| [代码8-15](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_TRPO_torch.html) | 信赖域策略优化算法智能体(PyTorch版本) | +| [代码8-16](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_tf.html) | 异策执行者/评论者算法智能体(TensorFlow版本) | +| [代码8-17](https://zhiqingxiao.github.io/rl-book/en2024/code/Acrobot-v1_OffPAC_torch.html) | 异策执行者/评论者算法智能体(PyTorch版本) | +| [代码9-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) | OU过程 | +| [代码9-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_tf.html) | 深度确定性策略梯度算法的智能体(TensorFlow版本) | +| [代码9-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_DDPG_torch.html) | 深度确定性策略梯度算法的智能体(PyTorch版本) | +| [代码9-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_tf.html) | 双重延迟深度确定性算法智能体(TensorFlow版) | +| [代码9-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Pendulum-v1_TD3_torch.html) | 双重延迟深度确定性算法智能体(PyTorch版) | +| [代码10-1](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_ClosedForm.html) | `LunarLander-v2`的闭式解 | +| [代码10-2](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_ClosedForm.html) | `LunarLanderContinuous-v2`的闭式解 | +| [代码10-3](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_tf.html) | 柔性Q学习智能体(使用TensorFlow) | +| [代码10-4](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SQL_torch.html) | 柔性Q学习智能体(使用PyTorch) | +| [代码10-5](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_tf.html) | 柔性执行者/评论者算法智能体(TensorFlow版) | +| [代码10-6](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwoA_torch.html) | 柔性执行者/评论者算法智能体(PyTorch版) | +| [代码10-7](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_tf.html) | 带自动熵调节的柔性执行者/评论者算法智能体(TensorFlow版) | +| [代码10-8](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLander-v2_SACwA_torch.html) | 带自动熵调节的柔性执行者/评论者算法智能体(PyTorch版) | +| [代码10-9](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_tf.html) | 用于连续动作空间的带自动熵调节的柔性执行者/评论者算法(使用TensorFlow) | +| [代码10-10](https://zhiqingxiao.github.io/rl-book/en2024/code/LunarLanderContinuous-v2_SACwA_torch.html) | 用于连续动作空间的带自动熵调节的柔性执行者/评论者算法(使用PyTorch) | +| [代码11-1](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ClosedForm.html) | `BipedalWalker-v3`的闭式解 | +| [代码11-2](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html) | 进化算法智能体 | +| [代码11-3](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ES.html) | 训练和测试进化算法智能体 | +| [代码11-4](https://zhiqingxiao.github.io/rl-book/en2024/code/BipedalWalker-v3_ARS.html) | 增强随机搜索算法智能体 | +| [代码12-1](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_ClosedForm.html) | `PongNoFrameskip-v4`的闭式解 | +| [代码12-2](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | 包装后的环境类 | +| [代码12-3](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_tf.html) | 类别深度Q网络算法智能体(TensorFlow版本) | +| [代码12-4](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_CategoricalDQN_torch.html) | 类别深度Q网络算法智能体(PyTorch版本) | +| [代码12-5](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_tf.html) | 分位数回归深度Q网络算法智能体(TensorFlow版本) | +| [代码12-6](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_QRDQN_torch.html) | 分位数回归深度Q网络算法智能体(PyTorch版本) | +| [代码12-7](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) | 分位网络(TensorFlow版本) | +| [代码12-8](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | 分位网络(PyTorch版本) | +| [代码12-9](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_tf.html) | 含蓄分位网络智能体(TensorFlow版本) | +| [代码12-10](https://zhiqingxiao.github.io/rl-book/en2024/code/PongNoFrameskip-v4_IQN_torch.html) | 含蓄分位网络智能体(PyTorch版本) | +| [代码13-1](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 环境类`BernoulliMABEnv` | +| [代码13-2](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 将环境类`BernoulliMABEnv`注册到Gym库里 | +| [代码13-3](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 用 $\epsilon$ 贪心策略求解 | +| [代码13-4](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 估计平均遗憾 | +| [代码13-5](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 用第一置信上界求解 | +| [代码13-6](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 用Bayesian置信上界求解 | +| [代码13-7](https://zhiqingxiao.github.io/rl-book/en2024/code/BernoulliMABEnv-v0_demo.html) | 用Thompson采样求解 | | [代码14-1](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) |`BoardGameEnv`类的构造函数 | | [代码14-2](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | `BoardGameEnv`类的`is_vaild()`函数、`has_valid()`函数和`get_valid()`函数 | | [代码14-3](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/kinarow.py) | `KInARowEnv`类的`get_winner()`函数 | | [代码14-4](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | `BoardGameEnv`类的`next_step()`函数及其辅助函数`get_next_state()` | | [代码14-5](https://github.com/ZhiqingXiao/boardgame2/blob/master/boardgame2/env.py) | `BoardGameEnv`类的`next_step()`函数及其辅助函数`get_next_state()` | -| [代码14-6](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | 穷尽式搜索 | -| [代码14-7](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_ExhaustiveSearch.html) | 自我对弈 | -| [代码14-8](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero智能体经验回放 | -| [代码14-9](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero用的网络(TensorFlow版本) | -| [代码14-10](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero用的网络(PyTorch版本) | -| [代码14-11](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero智能体(TensorFlow版本) | -| [代码14-12](https://zhiqingxiao.github.io/rl-book/en2023/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero智能体(PyTorch版本) | -| [代码15-1](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 调整摄像头 | -| [代码15-2](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 与环境交互并渲染 | -| [代码15-3](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 状态动作对的经验回放 | -| [代码15-4](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_tf.html) | 行为克隆模仿学习智能体(TensorFlow版本) | -| [代码15-5](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_BC_torch.html) | 行为克隆模仿学习智能体(PyTorch版本) | -| [代码15-6](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | 生成对抗模仿学习邻近策略优化算法智能体(TensorFlow版本) | -| [代码15-7](https://zhiqingxiao.github.io/rl-book/en2023/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | 生成对抗模仿学习邻近策略优化算法智能体(PyTorch版本) | -| [代码16-1](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 任务“老虎”环境类`TigerEnv` | -| [代码16-2](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 注册环境类`TigerEnv` | -| [代码16-3](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_ClosedForm.html) | 折扣因子 $\gamma=1$ 时的最优策略 | -| [代码16-4](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 信念价值迭代 | -| [代码16-5](https://zhiqingxiao.github.io/rl-book/en2023/code/Tiger-v0_Plan_demo.html) | 用基于点的价值迭代求解 | +| [代码14-6](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | 穷尽式搜索 | +| [代码14-7](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_ExhaustiveSearch.html) | 自我对弈 | +| [代码14-8](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero智能体经验回放 | +| [代码14-9](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero用的网络(TensorFlow版本) | +| [代码14-10](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero用的网络(PyTorch版本) | +| [代码14-11](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_tf.html) | AlphaZero智能体(TensorFlow版本) | +| [代码14-12](https://zhiqingxiao.github.io/rl-book/en2024/code/TicTacToe-v0_AlphaZero_torch.html) | AlphaZero智能体(PyTorch版本) | +| [代码15-1](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 调整摄像头 | +| [代码15-2](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_ClosedForm_demo.html) | 与环境交互并渲染 | +| [代码15-3](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) | 状态动作对的经验回放 | +| [代码15-4](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_tf.html) | 行为克隆模仿学习智能体(TensorFlow版本) | +| [代码15-5](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_BC_torch.html) | 行为克隆模仿学习智能体(PyTorch版本) | +| [代码15-6](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_tf.html) | 生成对抗模仿学习邻近策略优化算法智能体(TensorFlow版本) | +| [代码15-7](https://zhiqingxiao.github.io/rl-book/en2024/code/HumanoidBulletEnv-v0_GAILPPO_torch.html) | 生成对抗模仿学习邻近策略优化算法智能体(PyTorch版本) | +| [代码16-1](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | 任务“老虎”环境类`TigerEnv` | +| [代码16-2](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | 注册环境类`TigerEnv` | +| [代码16-3](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_ClosedForm.html) | 折扣因子 $\gamma=1$ 时的最优策略 | +| [代码16-4](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) | 信念价值迭代 | +| [代码16-5](https://zhiqingxiao.github.io/rl-book/en2024/code/Tiger-v0_Plan_demo.html) | 用基于点的价值迭代求解 | diff --git a/zh2023/errata.md b/zh2023/errata.md index fe6fe92..2cc0566 100644 --- a/zh2023/errata.md +++ b/zh2023/errata.md @@ -30,6 +30,14 @@ $\gamma+\gamma\sum\limits_ {\tau=1}^{+\infty}{\gamma^\tau R_ {\left(t+1\right)+\ $\gamma+\gamma\sum\limits_ {\tau=0}^{+\infty}{\gamma^\tau R_ {\left(t+1\right)+\tau+1}}$ +## 第30页第2组通栏数学表达式 + +$\mathrm{E}_ \pi$ + +#### 改为 + +$\mathrm{E}$ + ## 第37页最后1行和第38页前3行(共4处) @@ -40,20 +48,33 @@ $\sum\limits_ {t=1}^{+\infty}$ $\sum\limits_ {t=0}^{+\infty}$ -## 第38页第2组通栏数学表达式中间那行 -$\Pr_ \pi\left[\mathsfit{S}_ 0=\mathsfit{s}'\middle\vert\mathsfit{S}_ 0=\mathsfit{s}_ 0\right]+$ +## 第38页倒数第2个通栏数学表达式最后的部分 + +( $\mathsfit{s}'\in\mathcal{S}$ ) #### 改为 -$\Pr_ \pi\left[\mathsfit{S}_ t=\mathsfit{s}'\middle\vert\mathsfit{S}_ 0=\mathsfit{s}_ 0\right]+$ +$\mathsfit{s}\in\mathcal{S}$ ## 第40页第5组通栏数学表达式 -$\rho\left(\mathsfit{s}'\right)=\sum\limits_ {\mathsfit{s}_ 0\in\mathcal{S}}{p_ {\mathsfit{S}_ 0}\left(\mathsfit{s}\right)\sum\limits_ {t=0}^{+\infty}{\sum\limits_ {\mathsfit{s}\in\mathcal{S},\mathsfit{a}\in\mathcal{A}\left(\mathsfit{s}\right)}}}$ + +$\rho\left(\mathsfit{s}'\right)=\sum\limits_ {\mathsfit{s}_ 0\in\mathcal{S}}{p_ {\mathsfit{S}_ 0}\left(\mathsfit{s}\right)\sum\limits_ {t=0}^{+\infty}{\sum\limits_ {\mathsfit{s}\in\mathcal{S},\mathsfit{a}\in\mathcal{A}\left(\mathsfit{s}\right)}\gamma^t\Pr_ \pi\left[\mathsfit{S}_ t=\mathsfit{s}'\middle\vert\mathsfit{S}_ 0=\mathsfit{s}\right]}}=\rho_ \pi\left(\mathsfit{s}'\right)$ #### 改为 -$\rho\left(\mathsfit{s}'\right)=\sum\limits_ {t=0}^{+\infty}{\sum\limits_ {\mathsfit{s}\in\mathcal{S},\mathsfit{a}\in\mathcal{A}\left(\mathsfit{s}\right)}}$ +$\rho\left(\mathsfit{s}'\right)=\sum\limits_ {t=0}^{+\infty}{\gamma^t\sum\limits_ {\mathsfit{s}_ 0\in\mathcal{S}}\Pr_\pi\left[\mathsfit{S}_ t=\mathsfit{s}'\middle\vert\mathsfit{S}_ 0=\mathsfit{s}_ 0\right]p_ {\mathsfit{S}_ 0}\left(\mathsfit{s}_ 0\right)}=\sum\limits_ {t=0}^{+\infty}{\gamma^t\Pr_ \pi\left[\mathsfit{S}_ t=\mathsfit{s}'\right]}=\rho_ \pi\left(\mathsfit{s}'\right)$ + +注:在这里的证明中,在证明了 $\rho\left(\mathsfit{s}\right)$ 满足 Bellman 期望方程后,说明了它是不动点。由于 $\rho_ \pi\left(\mathsfit{s}\right)$ 也满足同样的 Bellman 期望方程,由不动点的唯一性可得 $\rho\left(\mathsfit{s}\right)=\rho_ \pi\left(\mathsfit{s}\right)$ 。 + + +## 第41页第1组通栏数学表达式第4行 + +这里用到了全概率公式 + +#### 改为 + +这里用到了全期望公式 ## 第41页第1组通栏数学表达式倒数第3行 @@ -72,6 +93,15 @@ $\mathsfit{s}\in\mathcal{S},\mathsfit{a}\in\mathcal{A}\left(\mathsfit{s}\right), $p_ \ast\left({\mathsfit{s'},\mathsfit{a'}|\mathsfit{s},\mathsfit{a}}\right)=\pi_ \ast\left(\mathsfit{a'}\middle\vert\mathsfit{s'}\right)p\left( \mathsfit{s'}\mid\mathsfit{s},\mathsfit{a}\right),\quad\mathsfit{s}\in\mathcal{S},\mathsfit{a}\in\mathcal{A}\left(\mathsfit{s}\right),\mathsfit{s'}\in\mathcal{S},\mathsfit{a'}\in\mathcal{A}\left(\mathsfit{s'}\right)$ +## 第49页知识卡片里第2组通栏数学表达式第1行 + +$\mathop{\text{minimize}}\limits_\mathbfit{y}$ + +#### 改为 + +$\mathop{\text{maximize}}\limits_\mathbfit{y}$ + + ## 第49页知识卡片里最后一行 $\mathbfit{y}\geqslant0$ @@ -121,6 +151,15 @@ $p\left(\mathsfit{s}'\middle\vert\mathsfit{s},\mathsfit{a}'\right)$ $p\left(\mathsfit{s}'\middle\vert\mathsfit{s},\mathsfit{a}\right)$ +## 第63页第2组通栏数学表达式第4行 + +$=\gamma\sum\limits_{\mathsfit{s}'\in\mathcal{S}}{p\left(\mathsfit{s}'\middle\vert\mathsfit{s},\mathsfit{a}\right)d_\infty\left(q',q''\right)}$ + +#### 改为 + +$\leqslant\gamma\sum\limits_{\mathsfit{s}'\in\mathcal{S}}{p\left(\mathsfit{s}'\middle\vert\mathsfit{s},\mathsfit{a}\right)d_\infty\left(q',q''\right)}$ + + ## 第66页算法3.3第1.2步 $\pi\left(\mathsfit{s}\right)=\arg\max_ \mathsfit{\mathsfit{a}}{q\left(\mathsfit{s},\mathsfit{a}\right)}$ @@ -247,6 +286,79 @@ $\gamma^2\mathrm{E}_ {\pi\left(\boldsymbol\theta\right)}\left[\nabla{v_ {\pi\lef $\gamma^2\mathrm{E}_ {\pi\left(\boldsymbol\theta\right)}\left[\nabla{v_ {\pi\left(\boldsymbol\theta\right)}}\left(\mathsfit{S}_ 2\right)\right]$ +## 第180页第1组通栏数学表达式 + +$\boldsymbol\theta_ {t+1}\leftarrow\boldsymbol\theta_ t+\alpha\gamma^t G_t\nabla\ln\pi\left(\mathsfit{A}_ t\middle\vert\mathsfit{S}_ t;\boldsymbol\theta\right),\quad t=0,1,\cdots$ + +#### 改为 + +$\boldsymbol\theta\leftarrow\boldsymbol\theta+\alpha\gamma^t G_t\nabla\ln\pi\left(\mathsfit{A}_ t\middle\vert\mathsfit{S}_ t;\boldsymbol\theta\right)$ + + +## 第183行算法7-3第2.4.2步第1行 + +$\gamma^t G_ t$ + +#### 改为 + +$\gamma^t G$ + + +## 第206页知识卡片“Fisher信息矩阵”内文第2段第1行 + +$\sum\limits_\mathsfit{x}{p\left(\mathsfit{x};\boldsymbol\theta\right)\nabla\ln p\left(\mathsfit{X};\boldsymbol\theta\right)}=\sum\limits_\mathsfit{x}{\nabla p\left(\mathsfit{X};\boldsymbol\theta\right)}=\nabla\sum\limits_\mathsfit{x}{p\left(\mathsfit{X};\boldsymbol\theta\right)}$ + +#### 改为 + +$\sum\limits_\mathsfit{x}{p\left(\mathsfit{x};\boldsymbol\theta\right)\nabla\ln p\left(\mathsfit{x};\boldsymbol\theta\right)}=\sum\limits_\mathsfit{x}{\nabla p\left(\mathsfit{x};\boldsymbol\theta\right)}=\nabla\sum\limits_\mathsfit{x}{p\left(\mathsfit{x};\boldsymbol\theta\right)}$ + + +## 第208页第8.4.2节正文第1段倒数第2行 + +比 $g_{\pi\left(\boldsymbol\theta_ k\right)}$ 小 + +#### 改为 + +比 $g_{\pi\left(\boldsymbol\theta\right)}$ 小 + + +## 第209页倒数第3组通栏数学表达式 + +$\boldsymbol{0}+\mathbfit{g}\left({\boldsymbol\theta}_ k\right)\left({\boldsymbol\theta}-{\boldsymbol\theta}_ k\right)$ + +#### 改为 + +$0+\left[\mathbfit{g}\left({\boldsymbol\theta}_ k\right)\right]^\mathrm{T}\left({\boldsymbol\theta}-{\boldsymbol\theta}_ k\right)$ + + +## 第209页倒数第2组通栏数学表达式第1行 + +$\mathbfit{g}\left({\boldsymbol\theta}_ k\right)\left({\boldsymbol\theta}-{\boldsymbol\theta}_ k\right)$ + +#### 改为 + +$\left[\mathbfit{g}\left({\boldsymbol\theta}_ k\right)\right]^\mathrm{T}\left({\boldsymbol\theta}-{\boldsymbol\theta}_ k\right)$ + + +## 第252页正文倒数第2~3行 + +在2.4.3步和2.4.5步 + +#### 改为 + +在第2.4.5步 + + +## 第256页第2个通栏数学表达式 + + +$\int_0^t{e^{\theta\left(\tau-t\right)}dB_ t}$ + +#### 改为 + +$\int_0^t{e^{\theta\left(\tau-t\right)}dB_ \tau}$ + + ## 第271页最后一组通栏数学表达式最后一行 $v_ \pi^\left(\text{H}\right)\left(\mathsfit{s},\mathsfit{a}\right)$ @@ -265,6 +377,33 @@ $\pi_ \pi^{\left(\text{H}\right)}$ $\pi$ +## 第272页第2组通栏数学表达式 + +$\sum\limits_ {\mathsfit{a}}$ + +#### 改为 + +$\sum\limits_ {\mathsfit{a}'}$ + + +## 第272页第3组通栏数学表达式 + +$\mathrm{E}_ {\left(\mathsfit{s},\mathsfit{a}\right)\sim\rho_ \pi}\left[q_ \pi^{\left(\mathrm{H}\right)}\left(\mathsfit{s},\mathsfit{a}\right)\right]$ + +#### 改为 + +$\mathrm{E}_ {\left(\mathsfit{S},\mathsfit{A}\right)\sim\rho_ \pi}\left[q_ \pi^{\left(\mathrm{H}\right)}\left(\mathsfit{S},\mathsfit{A}\right)\right]$ + + +## 第273页第1组通栏数学表达式 + +$\mathrm{E}_ {\left(\mathsfit{s},\mathsfit{a}\right)\sim\rho_ \pi}\left[q_ \pi^{\left(\text{soft}\right)}\left(\mathsfit{s},\mathsfit{a}\right)+\alpha^{\left(\mathrm{H}\right)}\mathrm{H}\left[\pi\left(\cdot\middle\vert\mathsfit{s}\right)\right]\right]$ + +#### 改为 + +$\mathrm{E}_ {\left(\mathsfit{S},\mathsfit{A}\right)\sim\rho_ \pi}\left[q_ \pi^{\left(\text{soft}\right)}\left(\mathsfit{S},\mathsfit{A}\right)+\alpha^{\left(\mathrm{H}\right)}\mathrm{H}\left[\pi\left(\cdot\middle\vert\mathsfit{S}\right)\right]\right]$ + + ## 第276页最后一组通栏数学表达式(两处) $\mathrm{E}_ {\pi\left(\theta\right)}$ @@ -321,6 +460,15 @@ $\mathrm{E}_ {\pi\left(\boldsymbol\theta\right)}\left[\left(q_ {\pi\left(\boldsy $\mathrm{E}_ {\mathsfit{A}\sim\pi\left(\cdot\middle\vert\mathsfit{s};\boldsymbol\theta\right)}\left[\left(q_ {\pi\left(\boldsymbol\theta\right)}^{\left(柔\right)}\left(\mathsfit{s},\mathsfit{A}\right)-\alpha^{\left(\text{H}\right)}\left(\ln\pi\left(\mathsfit{A}\middle\vert\mathsfit{s};\boldsymbol\theta\right)+1\right)\right)\nabla\ln\pi\left(\mathsfit{A}\middle\vert\mathsfit{s};\boldsymbol\theta\right)\right]$ +## 第283页第3组通栏数学表达式两行各有一处共两处 + +$\mathrm{E}_{\mathsfit{A}'\sim\pi\left(\boldsymbol\theta\right)}$ + +#### 改为 + +$\mathrm{E}_{\mathsfit{A}'\sim\pi\left(\cdot\middle\vert\mathsfit{s};\boldsymbol\theta\right)}$ + + ## 第284页算法10-2第2.2.2.2步,第285页算法10-3第2.2.2.3步(共2处) $U_ t^{\left(q\right)}\leftarrow R_ {t+1}+$ @@ -348,6 +496,15 @@ $U_ t^{\left(v\right)}$ $U^{\left(v\right)}$ +## 第285页算法10-3第2.2.2.2步 + +#### 删去 + +( $(\mathsfit{S},\mathsfit{A},R$ + +$\mathsfit{S}',D')\in\mathcal{B}$ ) + + ## 第288页代码清单10-2 ```python @@ -381,15 +538,33 @@ $U^{\left(v\right)}$ ``` +## 第322页正文第2段第1行 + +在 $\left(Q_ p,d_{\mathrm{supW},p}\right)$ 上 + +#### 改为 + +在 $\left(\mathcal{Q}_ p,d_{\mathrm{supW},p}\right)$ 上 + + ## 第326页倒数第2行 类别分布 $p^{\left(\cdot\right)}\left(\cdot,\cdot\right)$ 和自 -#### 成为 +#### 改为 类别分布 $p^{\left(\cdot\right)}\left(\cdot,\cdot;\boldsymbol{w}\right)$ 和自 +## 第328页算法12-1第2.2.2.4步最后一行,第329页算法12-2第2.2.2.4步第三行(各一处,共2处) + +$\sum\limits_{i\in\mathcal{I}}$ + +#### 改为 + +$\sum\limits_{j\in\mathcal{I}}$ + + ## 第330-331页知识卡片第2-5段 考虑随机变量 $X$ 的在给定累积概率值 $\omega\in\left[0,1\right]$ 下的分位数 $\phi_ X\left(\omega\right)$。 $\phi_ X\left(\omega\right)>X$ 的概率是 $\omega$,…… @@ -451,6 +626,8 @@ $\frac1c\sum\limits_ {i=0}^{c-1}{\ell_ \text{QR}\left(x_ i-\phi\right)}$ 记 $\mathsfit{a}_ \ast$ 是最优动作。 +作者注:把 $c=c_\mathsfit{a}$ , $\bar{X}=\tilde{q}_ {c_ \mathsfit{a}}\left(\mathsfit{a}\right)$ ,和 $\varepsilon=\sqrt{\frac{2\ln\kappa}{c_ \mathsfit{a}}}$ 代入Hoeffding不等式 $\Pr\left[\bar{X}-\mathrm{E}\left[\bar{X}\right]\geqslant\varepsilon\right]\leqslant\exp\left(-2c\varepsilon^2\right)$ 和 $\Pr\left[\bar{X}-\mathrm{E}\left[\bar{X}\right]\leqslant-\varepsilon\right]\leqslant\exp\left(-2c\varepsilon^2\right)$ ,有 $\Pr\left[\tilde{q}_ {c_ \mathsfit{a}}\left(\mathsfit{a}\right)-q\left(\mathsfit{a}\right)\geqslant\sqrt{\frac{2\ln\kappa}{c_ \mathsfit{a}}}\right]\leqslant\frac{1}{\kappa^4}$ 和 $\Pr\left[\tilde{q}_ {c_ \mathsfit{a}}\left(\mathsfit{a}\right)-q\left(\mathsfit{a}\right)\leqslant-\sqrt{\frac{2\ln\kappa}{c_ \mathsfit{a}}}\right]\leqslant\frac{1}{\kappa^4}$ ,所以 $\Pr\left[q\left(\mathsfit{a}\right)+\sqrt{\frac{2\ln\kappa}{c_\mathsfit{a}}}\leqslant\tilde{q}_ {c_ \mathsfit{a}}\left(\mathsfit{a}\right)\right]\leqslant\frac{1}{\kappa^4}$ 和 $\Pr\left[\tilde{q}_ {c_ \mathsfit{a}}\left(\mathsfit{a}\right)+\sqrt{\frac{2\ln\kappa}{c_ \mathsfit{a}}}\leqslant q\left(\mathsfit{a}\right)\right]\leqslant\frac{1}{\kappa^4}$ 。在后面那个式子中取 $\mathsfit{a}=\mathsfit{a}_ \ast$ 和 $c_ \ast=c_ \mathsfit{a}$ 得证。 + ## 第363页倒数第4行 @@ -579,7 +756,7 @@ Gaussien Gaussian -## 392页正文第4行 +## 第392页正文第4行 $\left(R-r\left(\mathsfit{s},\mathsfit{a}\right)\right)^2$ @@ -614,6 +791,7 @@ $d_ \rm{TV}\left(\rho_ {\pi'}\left(\cdot,\cdot\right)\middle\|\rho_ {\pi''}\left $d_ \rm{TV}\left(\rho_ {\pi'}\left(\cdot,\cdot\right)\middle\|\rho_ {\pi''}\left(\cdot,\cdot\right)\right)\leqslant\frac{1}{1-\gamma}\mathrm{E}_ {\mathsfit{S}\sim\rho_ {\pi''}}\left[d_ \mathrm{TV}\left(\pi'\left(\cdot\middle\vert\mathsfit{S}\right)\middle\|\pi''\left(\cdot\middle\vert\mathsfit{S}\right)\right)\right]$ + ## 第428页倒数第2组通栏数学表达式 $\sum\limits_ {\left(\mathsfit{S},\mathsfit{A}\right)\in\mathcal{D}}h\left(\mathsfit{A}\middle\vert\mathsfit{S};\boldsymbol\theta\right)-\mathop{\mathrm{logsumexp}}\limits_ {\mathsfit{a}\in\mathcal{A}\left(\mathsfit{S}\right)}h\left(\mathsfit{a}\middle\vert\mathsfit{S};\boldsymbol\theta\right)$ @@ -645,13 +823,22 @@ $=\sum\limits_ {\mathsfit{s'},\tilde r}{\tilde p\left(\mathsfit{s'},\tilde r\mid $=\sum\limits_ {\mathsfit{s'},r}{p\left(\mathsfit{s'},r\middle\vert\mathsfit{s},\mathsfit{a}\right)\left[r-{\bar r}_ \pi+{\tilde v}_ \pi\left(\mathsfit{s'}\right)\right]}$ -## 第451页第4组通栏数学表达式第2行 +## 第451页第2组通栏数学表达式第1行 -$p_ \pi\left(\mathsfit{s'}\middle\vert\mathsfit{s},\mathsfit{a}\right)$ +$\sum\limits_{\mathsfit{s}',r}{\tilde{p}\left(\mathsfit{s}',\tilde{r}\middle\vert\mathsfit{s},\mathsfit{a}\right)}$ #### 改为 -$p\left(\mathsfit{s'}\middle\vert\mathsfit{s},\mathsfit{a}\right)$ +$\sum\limits_{\mathsfit{s}',\tilde{r}}{\tilde{p}\left(\mathsfit{s}',\tilde{r}\middle\vert\mathsfit{s},\mathsfit{a}\right)}$ + + +## 第451页第4组通栏数学表达式第2行 + +$\sum\limits_{\mathsfit{s}'}p_ \pi\left(\mathsfit{s'}\middle\vert\mathsfit{s},\mathsfit{a}\right)$ + +#### 改为 + +$\sum\limits_{\mathsfit{s}',\mathsfit{a}'}p_ \pi\left(\mathsfit{s'},\mathsfit{a}'\middle\vert\mathsfit{s},\mathsfit{a}\right)$ ## 第453页第4组通栏数学表达式 @@ -676,6 +863,15 @@ $\tilde{q}_ \pi\left(\mathsfit{S}_ {t+1},\mathsfit{A}_ {t+1}\right)\left(1-D'\ri $\tilde{q}_ \pi\left(\mathsfit{S}_ {t+1},\mathsfit{A}_ {t+1}\right)\left(1-D_ {t+1}\right)$ +## 第454页最后1组通栏数学表达式 + +$\tilde{q}\left(\mathsfit{S},\mathsfit{A}\right)$ + +#### 改为 + +$\tilde{q}\left(\mathsfit{S},\mathsfit{A};\mathbfit{w}\right)$ + + ## 第455页算法16-1第2.2.6步第2行 $\nabla q\left(\mathsfit{S},\mathsfit{A},\mathbfit{w}\right)$ @@ -685,6 +881,13 @@ $\nabla q\left(\mathsfit{S},\mathsfit{A},\mathbfit{w}\right)$ $\nabla\tilde{q}\left(\mathsfit{S},\mathsfit{A},\mathbfit{w}\right)$ +## 第456页算法16-2第2.2.4步第1行和第3行(共2处) + +#### 删去 + +$\gamma$ + + ## 第462页第2组通栏数学表达式(1处)和第4组通栏数学表达式(2处)共3处 $\frac1h\sum\limits_ {0<\tau\leqslant h}{\gamma^\tau R_ {t+\tau}}$ @@ -759,6 +962,23 @@ $\omega\left(\mathsfit{r},\mathsfit{s'},\mathsfit{o}\middle\vert b,\mathsfit{a}\ $\omega\left(r,\mathsfit{s'},\mathsfit{o}\middle\vert b,\mathsfit{a}\right)$ +## 第475页表格16-8中间那列 + +| 动作 $\mathsfit{a}$ | +| ---- | +| $\mathsfit{a}_ \text{听}$ | +| $\mathsfit{a}_ \text{听}$ | +| $\mathsfit{a}_ \text{听}$ | + +#### 改为 + +| 动作 $\mathsfit{a}$ | +| ---- | +| $\mathsfit{a}_ \text{左}$ | +| $\mathsfit{a}_ \text{右}$ | +| $\mathsfit{a}_ \text{听}$ | + + ## 第477页第1个通栏数学表达式 $q_ \pi\left(b,\mathsfit{a}\right)=r\left(b,\mathsfit{a}\right)+\gamma\sum\limits_ \mathsfit{o}{\omega\left(\mathsfit{o}\middle\vert b,\mathsfit{a}\right)v_ \pi\left(u\left(b,\mathsfit{a},\mathsfit{o}\right)\right)}$ @@ -768,6 +988,33 @@ $q_ \pi\left(b,\mathsfit{a}\right)=r\left(b,\mathsfit{a}\right)+\gamma\sum\limit $q_ \pi\left(b,\mathsfit{a}\right)=r\left(b,\mathsfit{a}\right)+\gamma\sum\limits_ \mathsfit{o}{\omega\left(\mathsfit{o}\middle\vert b,\mathsfit{a}\right)v_ \pi\left(\mathfrak{u}\left(b,\mathsfit{a},\mathsfit{o}\right)\right)}$ +## 第480页倒数第2个通栏数学表达式 + +$\prod\limits_{\tau'=\tau}^{t-1}$ + +#### 改为 + +$\prod\limits_{\tau'=t}^{\tau-1}$ + + +## 第480页倒数第1个通栏数学表达式 + +$\sum\limits_{\tau=0}^{t_ \text{max}-1}$ + +#### 改为 + +$\sum\limits_{\tau=0}^{t_ \text{max}-t-1}$ + + ## 第481页第1组通栏数学表达式 -#### 把其中的 $\mathsfit{s}$ 改为 $\mathsfit{x}$ (共7处),把其中的 $\mathcal{S}$ 改为 $\mathcal{X}$ (共2处)。 +把其中的 $\mathsfit{s}$ 改为 $\mathsfit{x}$ (共7处),把其中的 $\mathcal{S}$ 改为 $\mathcal{X}$ (共2处)。 + + +## 第489页第3行 + +信任状态是 + +#### 改为 + +信任状态价值是 diff --git a/zh2023/notation.md b/zh2023/notation.md index a4bca57..575f098 100644 --- a/zh2023/notation.md +++ b/zh2023/notation.md @@ -86,7 +86,7 @@ | $\boldsymbol\rho$ | 访问频次的向量表示 | vector representation of visitation frequency | | $\phi$ | 分位数 | quantile | | ${\huge\tau}$, $\tau$ | 半Markov决策过程中的逗留时间 | sojourn time of SMDP | -| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率 | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks | +| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率 | cumulative probability in distribution RL; (lower case only) conditional probability for partially observable tasks | | $\mathit\Psi$ | 扩展的优势估计 | Generalized Advantage Estimate (GAE) | | **其他符号** | **含义** | **英文含义** | | $\stackrel{\text{d}}{=}$ | 分布相同 | share the same distribution | diff --git a/zh2023/setup/setupmac.md b/zh2023/setup/setupmac.md index ded9db2..017924e 100644 --- a/zh2023/setup/setupmac.md +++ b/zh2023/setup/setupmac.md @@ -10,7 +10,7 @@ **步骤:** -- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择MacOS Graphical版的安装包)。安装包名字像 `Anaconda3-2023.09-0-MacOSX-x86_64.pkg`(M芯片版安装包名字像`Anaconda3-2023.09-0-MacOSX-amd64.pkg`),大小约0.6 GB。 +- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择MacOS Graphical版的安装包)。安装包名字像 `Anaconda3-2024.02-1-MacOSX-x86_64.pkg`(M芯片版安装包名字像`Anaconda3-2024.02-1-MacOSX-amd64.pkg`),大小约0.6 GB。 - 双击安装包启动安装向导完成安装。需要安装在剩余空间大于13GB的硬盘上。(如果空间小于这个数,虽然也能完成Anaconda 3的安装,但是后续步骤的空间就不够了。13GB是后续所有步骤需要的空间。)安装过程中记下Anaconda的安装路径。默认路径为:`/opt/anaconda3`。后续操作会用到这个路径。 #### 新建conda环境 @@ -159,4 +159,3 @@ ``` pip install --upgrade pybullet ``` - diff --git a/zh2023/setup/setupwin.md b/zh2023/setup/setupwin.md index a32639b..2bde264 100644 --- a/zh2023/setup/setupwin.md +++ b/zh2023/setup/setupwin.md @@ -10,7 +10,7 @@ **步骤:** -- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择Windows版的安装包)。安装包名字像 `Anaconda3-2023.09-0-Windows-x86_64.exe`,大小约0.9GB。 +- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包(选择Windows版的安装包)。安装包名字像 `Anaconda3-2024.02-1-Windows-x86_64.exe`,大小约0.9GB。 - 双击安装包启动安装向导完成安装。需要安装在剩余空间大于13GB的硬盘上。(如果空间小于这个数,虽然也能完成Anaconda 3的安装,但是后续步骤的空间就不够了。13GB是后续所有步骤(除了安装Visual Studio以外)需要的空间。)安装过程中记下Anaconda的安装路径。默认路径为:`C:%HOMEPATH%\anaconda3`。后续操作会用到这个路径。 #### 新建conda环境 @@ -152,4 +152,3 @@ Visual Studio社区版是免费的,而且够用。装社区版就好了。 ``` pip install --upgrade pybullet ``` -