You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thanks for the questions! It's reasonable that the end steps for all episodes is 25 (I believe the max number of steps is set to 25 by default, and it can remain 25 even if you enable early stopping when the goal is achieved). As for the difference between `test_reward` and `test_bench/step_reward`, it's due to two major differences. First, the reward and benchmark loggers log things a little bit differently: (as far as I remember from my notes) the reward logger resets at the end of each episode whereas the benchmark logger resets only once at the collector's init(), so the trends can be different. Second, the `test_bench/step_reward` additionally divides the episode reward by the number of steps in each episode (i.e. avg reward per step). Please check the code for the reward and benchmark logger as well as `offpolicy_trainer` for your own understanding, and feel free to write your own logger for your purposes! Lmk if you have any other questions, thanks!
Thanks for your reply to the previous question. I follow your reminder and check the code for SimpleSpreadBenchmarkLogger and find the code that might be the key to the difference between these two metrics (i.e., test_reward and test_bench/step_reward). Here is the code:
Here you only add the info of the first agent (i.e., elem['n'][0]). However, for the default setting, there are 5 agents and the length of elem['n'] is 5. Moreover, each element in elem['n'] has different info for different agents, so the rewards can be different. This phenomenon does not occur in the computation of test_reward, so their trends are different. Could you help me check out if my understanding is correct? Thanks!
The text was updated successfully, but these errors were encountered:
Originally posted by @zixianma in #3 (comment)
Thanks for your reply to the previous question. I follow your reminder and check the code for
SimpleSpreadBenchmarkLogger
and find the code that might be the key to the difference between these two metrics (i.e.,test_reward
andtest_bench/step_reward
). Here is the code:alignment/map/tianshou/env/utils.py
Line 232 in 58754e4
Here you only add the info of the first agent (i.e., elem['n'][0]). However, for the default setting, there are 5 agents and the length of elem['n'] is 5. Moreover, each element in elem['n'] has different info for different agents, so the rewards can be different. This phenomenon does not occur in the computation of
test_reward
, so their trends are different. Could you help me check out if my understanding is correct? Thanks!The text was updated successfully, but these errors were encountered: