Difference between `kl_ctl_value`, `kl_per_token`, etc. in Tensorboard logs? #558

doyled-it · 2023-09-12T16:50:35Z

doyled-it
Sep 12, 2023

Title says it all and I can't find any formal definitions for these terms online. Can anyone define these for me?

kl_per_token
kl_ctl_value
approx_kl

Sep 13, 2023

Sure!

kl_per_token is a KL divergence between the initial model's policy and the current one $D_\text{KL}(\pi_t \mid \pi_0)$ [1] measured per token with an unbiased estimator [2]
kl_ctl_value is a scalar for the KL penalty, also referred to as $\beta$ or $\lambda_{KL}$, the current name just comes from openai's code [3]
approx_kl is a KL during PPO minibatch updates $D_\text{KL}(\pi_{t+1} \mid \pi_t)$ [4]

[1] https://arxiv.org/abs/1909.08593 Section 2.2
[2] http://joschu.net/blog/kl-approx.html
[3] https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L151
[4] https://github.com/vwxyzjn/cleanrl/blob/f36d4a642…

View full answer

maxreciprocate · 2023-09-13T16:56:19Z

maxreciprocate
Sep 13, 2023
Maintainer

Sure!

kl_per_token is a KL divergence between the initial model's policy and the current one $D_\text{KL}(\pi_t \mid \pi_0)$ [1] measured per token with an unbiased estimator [2]
kl_ctl_value is a scalar for the KL penalty, also referred to as $\beta$ or $\lambda_{KL}$, the current name just comes from openai's code [3]
approx_kl is a KL during PPO minibatch updates $D_\text{KL}(\pi_{t+1} \mid \pi_t)$ [4]

[1] https://arxiv.org/abs/1909.08593 Section 2.2
[2] http://joschu.net/blog/kl-approx.html
[3] https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L151
[4] https://github.com/vwxyzjn/cleanrl/blob/f36d4a6426b5b08efdab4f65d7f696983cb04d4c/cleanrl/ppo.py#L253C18-L253C18

3 replies

doyled-it Sep 14, 2023
Author

So kl_per_token is probably the best estimate as to how far the model has strayed from the original model?

maxreciprocate Sep 14, 2023
Maintainer

Correct, that's what it's used for

MITMhsu Oct 24, 2023

hi ~ I just read the unbiased estimator blog. It seems like when we estimate KL [q,p] ，the r should be r = p(x)/q(x). But in the PPO trainer, I see the logratio = logprobs - logprobs_ref. Should this be logprobs_ref - logprobs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between `kl_ctl_value`, `kl_per_token`, etc. in Tensorboard logs? #558

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Difference between kl_ctl_value, kl_per_token, etc. in Tensorboard logs? #558

doyled-it Sep 12, 2023

Replies: 1 comment · 3 replies

maxreciprocate Sep 13, 2023 Maintainer

doyled-it Sep 14, 2023 Author

maxreciprocate Sep 14, 2023 Maintainer

MITMhsu Oct 24, 2023

Difference between `kl_ctl_value`, `kl_per_token`, etc. in Tensorboard logs? #558

doyled-it
Sep 12, 2023

Replies: 1 comment 3 replies

maxreciprocate
Sep 13, 2023
Maintainer

doyled-it Sep 14, 2023
Author

maxreciprocate Sep 14, 2023
Maintainer