Backpropagation: Q Moving Average

Moving-Average-Q

This is a variation on MA (Gudmundsson and Björnsson 2011), BRUE(alpha) by Feldman and Domshlak, and partly of TO-MAST strategies (see Finnsson and Bjornsson 2008). While TO-MAST mainly uses MA for policy choices, Gudmundsson and Björnsson as well as Feldman and Domshlak apply them to the propagation of Q. MCTS in its Leela flavour often takes very long to revise initial optimistic (or pessimistic) evaluations that draw visits to suboptimal parts of the tree. The main idea is to make lc0 update of q more adaptive and have less "memory". I tried:

A decay-power function

q_ += (v - q_) / (std::powf(n_, gamma) + 1);
n_++;

Linear decay function

q_ += (v - q_) / ((float)n_ * Beta) + 1);
n_++;

Match	power gamma=0.75	linear beta=0.49
10000 Games / 100 Visits	Elo: 24.60 LOS: 100.00%	Elo: 45.04 LOS: 100.00%
1000 Games / 800 Visits	Elo: -4.52 LOS: 25.04%	Elo: 4.17 LOS: 72.38%
500 Games / 10000 Visits	not planned	Elo: -11.12 LOS: 4.94%

Puzzling that both schemes do so well at 100 visits, but get relatively worse at larger visit searches. Static approaches like BRUE(alpha) also did not really make a difference. One would assume that the large visit search tree should have many nodes near the leaves with low visits and thus the scheme should work here too.

Maybe there is an initial bias that is somehow correlated to the order of node visits and/or depth. It also might stem from a mismatch of the exploration and evaluation terms. There is definitely something here that is worth studying, but its need more serious analytical work. I will revisit this later and look at some limited and/or dynamic methods.

Individual matches in detail:

Gamma=0.75 visits=100 games=10000:
P1: +3138 -2431 =4431 Win: 53.54% Elo: 24.60 LOS: 100.00% P1-W: +1805 -1143 =2052 P1-B: +1333 -1288 =2379
Gamma=0.75 visits=800 games=1000:
P1: +180 -193 =627 Win: 49.35% Elo: -4.52 LOS: 25.04% P1-W: +95 -102 =303 P1-B: +85 -91 =324

Gamma=0.955 visits=800 games=1000:
P1: +182 -218 =600 Win: 48.20% Elo: -12.51 LOS:  3.59% P1-W: +104 -111 =285 P1-B: +78 -107 =315

Beta=0.49 visits=100 games=10000
P1: +3775 -2486 =3739 Win: 56.45% Elo: 45.04 LOS: 100.00% P1-W: +1954 -1046 =2000 P1-B: +1821 -1440 =1739
Beta=0.49 at 800 visits per move in 1000 games: 
P1: +210 -198 =592 Win: 50.60% Elo:  4.17 LOS: 72.38% P1-W: +122 -98 =280 P1-B: +88 -100 =312
Beta=0.49 visits=10000 games=500:
P1: +39 -55 =406 Win: 48.40% Elo: -11.12 LOS:  4.94% P1-W: +31 -15 =203 P1-B: +8 -40 =203

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backpropagation: Q Moving Average

Moving-Average-Q

Clone this wiki locally