-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soft transition from Average to Max backup operator #940
Comments
This is the part of the code where the average happens. Line 267 in db42c60
|
1+n/100 is about right based on some playing around with numbers. |
On Discord, @fersbery (@DanielUranga) pointed out that this idea is very similar to this previous PR (though the transition implementation is not): #243 |
@Naphthalin Do you happen to know if there is a cheap way to compute or approximate Lp norms? |
if we only use powers of 2 for p, which shouldn't cause problems, it's not too expensive, only 1 pow call. |
@AlexisOlson did you ever follow up on that idea? It seems to be very close to the Power-UCT of https://arxiv.org/pdf/1911.00384.pdf, which accelerates the convergence to "newly discovered evals" a little bit, but adds complication. Also, a continuous recalculation of Q values is one of the major slowdowns in the current implementation of #963 where it is side-stepped by doing the full recalculation only every N visits to a node. Definitely an interesting suggestion, as the "correct" calculation of evals is one of the 2 big shortcomings of the PUCT algorithm, the other being the long-term dependency on the policies discussed in #1231 |
@Naphthalin When I was thinking about this further, I realized I was missing half of the min-max idea and just transitioning to max with what I was describing. I'll have to read the paper you linked for a more proper idea. |
IMO don't waste your time with that paper, as the idea is pretty unsound and still doesn't address our core problem(s). I just wanted to document that at least there is a paper carrying out a similar idea :) |
Just linking this issue to #1734 so we find it again. Once we have the Q recalculation loop as a separate class/template, things like his could easily be tried. |
For node
a
with chilrenb_1
,b_2
,...,b_n
with evalsq_1
,q_2
,...q_n
,let
b_max
be the best child with evalq_max
.Instead of averaging the
q_i
to update the parent use an Lp norm average whereLpNormAvg({x_i}) = ( 1/n * Sum_i (x_i)^p )^(1/p)
For p = 1, this is the identical to the mean and as
p -> infinity
,LpNormAvg({x_i}) -> Max({x_i})
.Since
Q
is in[-1,1]
, we want to shift to a non-negative interval before taking an Lp average. Let's also rescale byq_max
so that all shifted and scaledq_i
are in[0,1]
and values don't blow up for largep
.Define
t_i = (q_i + 1) / (q_max + 1)
wheret_i
is in[0,1]
as explained above.Then the final average value to use for the parent node would be
LpNormAvg({t_i}) * (q_max + 1) - 1
to rescale and shift back to the
[-1,1]
range.The
p
to use for the Lp norm would be an increasing function of visits to nodea
. Simplyp = visits
is likely too fast of a transition.The text was updated successfully, but these errors were encountered: