Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue on the previous question. #4

Open
JeremyLinky opened this issue Dec 28, 2022 · 0 comments
Open

Continue on the previous question. #4

JeremyLinky opened this issue Dec 28, 2022 · 0 comments

Comments

@JeremyLinky
Copy link

JeremyLinky commented Dec 28, 2022

    Hi thanks for the questions! It's reasonable that the end steps for all episodes is 25 (I believe the max number of steps is set to 25 by default, and it can remain 25 even if you enable early stopping when the goal is achieved). As for the difference between `test_reward` and `test_bench/step_reward`, it's due to two major differences. First, the reward and benchmark loggers log things a little bit differently: (as far as I remember from my notes) the reward logger resets at the end of each episode whereas the benchmark logger resets only once at the collector's init(), so the trends can be different. Second, the `test_bench/step_reward` additionally divides the episode reward by the number of steps in each episode (i.e. avg reward per step). Please check the code for the reward and benchmark logger as well as `offpolicy_trainer` for your own understanding, and feel free to write your own logger for your purposes! Lmk if you have any other questions, thanks!

Originally posted by @zixianma in #3 (comment)

Thanks for your reply to the previous question. I follow your reminder and check the code for SimpleSpreadBenchmarkLogger and find the code that might be the key to the difference between these two metrics (i.e., test_reward and test_bench/step_reward). Here is the code:

bench_data = elem['n'][0]

Here you only add the info of the first agent (i.e., elem['n'][0]). However, for the default setting, there are 5 agents and the length of elem['n'] is 5. Moreover, each element in elem['n'] has different info for different agents, so the rewards can be different. This phenomenon does not occur in the computation of test_reward, so their trends are different. Could you help me check out if my understanding is correct? Thanks!

@JeremyLinky JeremyLinky changed the title Check Continue on the previous question. Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant