Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft transition from Average to Max backup operator #940

Open
AlexisOlson opened this issue Aug 29, 2019 · 9 comments
Open

Soft transition from Average to Max backup operator #940

AlexisOlson opened this issue Aug 29, 2019 · 9 comments

Comments

@AlexisOlson
Copy link
Contributor

For node a with chilren b_1,b_2,...,b_n with evals q_1,q_2,...q_n,
let b_max be the best child with eval q_max.

Instead of averaging the q_i to update the parent use an Lp norm average where
LpNormAvg({x_i}) = ( 1/n * Sum_i (x_i)^p )^(1/p)

For p = 1, this is the identical to the mean and as p -> infinity, LpNormAvg({x_i}) -> Max({x_i}).

Since Q is in [-1,1], we want to shift to a non-negative interval before taking an Lp average. Let's also rescale by q_max so that all shifted and scaled q_i are in [0,1] and values don't blow up for large p.

Define t_i = (q_i + 1) / (q_max + 1) where t_i is in [0,1] as explained above.
Then the final average value to use for the parent node would be
LpNormAvg({t_i}) * (q_max + 1) - 1
to rescale and shift back to the [-1,1] range.

The p to use for the Lp norm would be an increasing function of visits to node a. Simply p = visits is likely too fast of a transition.

@AlexisOlson
Copy link
Contributor Author

This is the part of the code where the average happens.

void Node::FinalizeScoreUpdate(float v, float d, int multivisit) {

@oscardssmith
Copy link
Contributor

1+n/100 is about right based on some playing around with numbers.

@AlexisOlson
Copy link
Contributor Author

AlexisOlson commented Aug 29, 2019

On Discord, @fersbery (@DanielUranga) pointed out that this idea is very similar to this previous PR (though the transition implementation is not): #243

@AlexisOlson
Copy link
Contributor Author

@Naphthalin Do you happen to know if there is a cheap way to compute or approximate Lp norms?

@oscardssmith
Copy link
Contributor

if we only use powers of 2 for p, which shouldn't cause problems, it's not too expensive, only 1 pow call.

@Naphthalin
Copy link
Contributor

@AlexisOlson did you ever follow up on that idea? It seems to be very close to the Power-UCT of https://arxiv.org/pdf/1911.00384.pdf, which accelerates the convergence to "newly discovered evals" a little bit, but adds complication.

Also, a continuous recalculation of Q values is one of the major slowdowns in the current implementation of #963 where it is side-stepped by doing the full recalculation only every N visits to a node. Definitely an interesting suggestion, as the "correct" calculation of evals is one of the 2 big shortcomings of the PUCT algorithm, the other being the long-term dependency on the policies discussed in #1231

@AlexisOlson
Copy link
Contributor Author

@Naphthalin When I was thinking about this further, I realized I was missing half of the min-max idea and just transitioning to max with what I was describing. I'll have to read the paper you linked for a more proper idea.

@Naphthalin
Copy link
Contributor

IMO don't waste your time with that paper, as the idea is pretty unsound and still doesn't address our core problem(s). I just wanted to document that at least there is a paper carrying out a similar idea :)

@Naphthalin
Copy link
Contributor

Just linking this issue to #1734 so we find it again. Once we have the Q recalculation loop as a separate class/template, things like his could easily be tried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants