We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SARSA 训练流程: 4. 根据当前策略做抽样: a˜t+1 ∼ πnow( · j st+1)。注意, a˜t+1 只是假想的动作,智能体 不予执行
看其他资料 SARSA算法在本次迭代后,会用 a˜t+1 更新 a(也就是说下一步一定会在s˜t+1 执行a˜t+1): s = s˜t+1 a = a˜t+1
The text was updated successfully, but these errors were encountered:
不对的。策略随时会更新,不能保证 t+1 时刻的动作是a˜t+1
Sorry, something went wrong.
每次迭代的最后一步就是给s和a赋值;相反,Q-learning才是下一次动作需要重新采样确定的
实现代码中的写法也是这样 https://hrl.boyuai.com/chapter/1/%E6%97%B6%E5%BA%8F%E5%B7%AE%E5%88%86%E7%AE%97%E6%B3%95#53-sarsa-%E7%AE%97%E6%B3%95
他们这种写法真的不严谨。。。需要假设policy不变,才能像他们这样实现
No branches or pull requests
SARSA 训练流程:
4. 根据当前策略做抽样: a˜t+1 ∼ πnow( · j st+1)。注意, a˜t+1 只是假想的动作,智能体
不予执行
看其他资料
SARSA算法在本次迭代后,会用 a˜t+1 更新 a(也就是说下一步一定会在s˜t+1 执行a˜t+1):
s = s˜t+1
a = a˜t+1
The text was updated successfully, but these errors were encountered: