Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensor-accum-0.17/dev+/uniform questions/discussions #87

Open
Umsturz opened this issue Dec 1, 2018 · 15 comments
Open

tensor-accum-0.17/dev+/uniform questions/discussions #87

Umsturz opened this issue Dec 1, 2018 · 15 comments

Comments

@Umsturz
Copy link

Umsturz commented Dec 1, 2018

Hi I tried to build your fastexit-tensor-accum+ on ubuntu 16.04. going by the steps in the readme, see below the error message. But the build fails with the following errors. Any idea how to fix this?

cmake --build .
[ 3%] Built target gtest
[ 7%] Built target gtest_main
[ 9%] Building CXX object CMakeFiles/objs.dir/src/UCTSearch.cpp.o
lz/src/UCTSearch.cpp:268:45: warning: unused parameter ‘thread_num’ [-Wunused-parameter]
int thread_num) {
^
lz/src/UCTSearch.cpp: In member function ‘int UCTSearch::think(int, UCTSearch::passflag_t)’:
lz/src/UCTSearch.cpp:860:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’
backup_queue = {};
^
lz/src/UCTSearch.cpp: In member function ‘void UCTSearch::ponder()’:
lz/src/UCTSearch.cpp:944:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’
backup_queue = {};
^
At global scope:
cc1plus: warning: unrecognized command line option ‘-Wno-mismatched-tags’
cc1plus: warning: unrecognized command line option ‘-Wno-ignored-attributes’
CMakeFiles/objs.dir/build.make:254: recipe for target 'CMakeFiles/objs.dir/src/UCTSearch.cpp.o' failed
make[2]: *** [CMakeFiles/objs.dir/src/UCTSearch.cpp.o] Error 1
CMakeFiles/Makefile2:143: recipe for target 'CMakeFiles/objs.dir/all' failed
make[1]: *** [CMakeFiles/objs.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

Build instructions from the readme:

sudo apt install clinfo && clinfo

git clone https://github.com/gcp/leela-zero
cd leela-zero
git submodule update --init --recursive

sudo apt install libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev

mkdir build && cd build
cmake ..
cmake --build .
./tests
curl -O https://zero.sjeng.org/best-network
./leelaz --weights best-network

@alreadydone
Copy link
Owner

alreadydone commented Dec 1, 2018

Yeah some compilers can't deal with this. I suggest changing backup_queue={}; to
while(!backup_queue.empty()) { backup_queue.pop(); }
in both think() and ponder().
Not sure if this is less efficient or not.

@Umsturz
Copy link
Author

Umsturz commented Dec 2, 2018

May I ask what compiler you are using? I now tried with gcc 5.4. and clang 3.8. Even after changing backup_queue={}; there is a new error with both compilers.

In file included from lz/src/OpenCL.cpp:36:
lz/src/OpenCL.h:77:22: error: implicit instantiation of undefined template
'std::atomic'
std::atomic m_occupied{0};
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template
is declared here
struct atomic;
^
In file included from lz/src/OpenCL.cpp:36:
lz/src/OpenCL.h:78:22: error: implicit instantiation of undefined template
'std::atomic'
std::atomic idle_count{0};
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template
is declared here
struct atomic;
^

@alreadydone
Copy link
Owner

I think the error indicates you need `#include <atomic_base> in OpenCL.h. People compiled successfully on Ubuntu before; gcc 8.1.0 seems to be working.

@Umsturz
Copy link
Author

Umsturz commented Dec 5, 2018

Thank you it works with #include . Unfortunatly with multiple GPUS I dont see an improvement in ns.

@alreadydone
Copy link
Owner

alreadydone commented Dec 5, 2018

Thank you for testing! There definitely remains work to be done. Can you tell me what GPUs you have, what other branches (gcp/next, ihavnoid/batch-full, ihavnoid/tensorcore, or else?) you are comparing my branch with, and what parameters (--batchsize, -t) you are using in each case?

@alreadydone
Copy link
Owner

alreadydone commented Feb 26, 2019

You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+.
Tested on Google Cloud:
15270 pos/s with 4xV100, 256x19 net, and command
./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4

38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command
./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz

(both with 24vCPUs)

You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2, etc. The -t parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.

@Umsturz
Copy link
Author

Umsturz commented Feb 26, 2019 via email

@alreadydone
Copy link
Owner

A bug has been fixed in the tensor-accum-dev+ approach.

An experimental branch that gradually push policy towards uniform as visits increase to widen the search and help finding blind spots is https://github.com/alreadydone/lz/tree/tensor-accum-uniform (based on tensor-accum-dev+).
Two parameters are added: When a position's visit count reaches the value --uniform-visits (defaulted to 1,000,000), all moves will be considered equally in terms of policy. Below the value, the policy gradually drifts towards uniform as visits accrue. The parameter --exponent (defaulted to 1) controls how fast the policy drifts. Exponent 0 means the policy doesn't drift at all and always stays uniform.
To recover original behavior, set --uniform-visits to a very large number, and leave --exponent untouched.
This is inspired by some recent discussions, e.g. at LeelaChessZero/lc0#743

@Ishinoshita
Copy link

@alreadydone That's really nice! Progressive squashing even better than any of my fix formulas...
No later than this morning, I pushed 100k playouts on empty boardwith LZ200 net, on my old PC CPU only, just to find, after a long while, that only 4-4 and 3-4 had got visits. Your fix will definitively help. Thank you. Will learn how to compile so that I can play with it.

@Umsturz
Copy link
Author

Umsturz commented Mar 3, 2019

So I tried tensor-accum-uniform. There is no need for #include anymore, right? I compiled it with #include first and eventhough it compiled, leelaz threw some error on startup. But without the #include it worked. Do I need something else?

For benchmarking I start leelaz and send "genmove B". I tried two different sets of parameters, without using --uniform-visits and --exponent :
A) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 8 --batchsize 64

B) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32

The first game with A) started with B playing Tengen(K10). Quite interesting to say the least.
Second game with A) also started with B playing Tengen(K10) and White likes to play 5-4 first and then enclose the corner with 3-4

First and second game with B) looked normal, same opening the current nets like to play. All 4-4 points and 6-3 approach. Later double approach.

With leela#207 40x256 I get with A) for the first genmove B ca. 25000-27000ns.
For B) I get for the first genmove B ca 21000-24000ns.

What confuses me a little bit is the GPU utilization. During the first genmove B "nvidia-smi -l" shows following util. 0%/14%/43%/0%/28%/13%/0%/44%/ (just an example, but I tested this a couple of times and only some GPUs are utilized others stay at 0%. Maybe because of bad timing in the beginning by nvidia-smi.)
After issuing the following commands and waiting until they finished: genmove B, genmove W, genmove B, genmove W, genmove B the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?

Sometimes when exiting leelaz with "exit" it throws a segmentation fault(core dumped).

All in all it looks very promising (1.4x improvement), but Tengen makes me a bit skeptical ; )

@alreadydone
Copy link
Owner

alreadydone commented Mar 4, 2019

  • The uniform branch defaults --uniform-visits to 1,000,000. If you want to recover original search behavior, use for example --uniform-visits 10000000000000.
  • I haven't observed tengen being played on the first move, and it's definitely strange to see # 207 play it. It's probably caused by a large number of concurrent threads making the search very wide at every level of the tree and uniformization of policy. With uniform-visits as above, maybe the engine won't play tengen even with 64x8 threads.
  • Look at pos/s to benchmark performance instead of n/s. pos/s is the amount of positions actually processed by the GPUs, while n/s includes positions in the cache and from symmetry.
  • It's not recommended to use the empty board to test performance. The n/s value will be boosted because there are 700% free playouts due to 8-fold symmetry. The pos/s value on the other hand will be dragged down because the search can't find enough unevaluated positions to feed the GPUs, and there's probably contention when accessing NNCache, which is mutex protected. That's probably why GPU utilization is low at the first move but full after four moves. However, some GPUs' utilization staying at 0 still surprise me.
    Instead, use --benchmark, which uses an asymmetric position three moves into the game, or load a sgf into midgame and genmove from there. In general, higher batchsize and worker lead to higher pos/s, but when you are able to saturate GPUs or achieve maximum pos/s in such normal positions, it's not recommended to increase --batchsize and --worker further. Among your A) and B), --worker 3 --batchsize 32 is the more reasonable one, though I think batchsize can be decreased further.

After issuing the following commands ... the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?

  • Yeah these branches will keep pondering if --noponder is not set unless you issue the command stop or name (just hitting a key won't stop it). However, some people told me that --noponder doesn't work and I'm yet to confirm this bug.

  • All threads should be joined when exiting, so segmentation fault is unexpected.

  • Thanks for testing, and I'll keep an eye on identified issues when I test.

  • Added notice 4/29/2019: The "fractional backup" feature causes the displayed visits to be lower than the playouts (much lower if batchsize or gpu is large); can be disabled with --disable-frac-backup.

@alreadydone alreadydone changed the title fastexit-tensor-accum+ fails to build tensor-accum-dev+/uniform questions Mar 4, 2019
@alreadydone alreadydone changed the title tensor-accum-dev+/uniform questions tensor-accum-dev+/uniform questions/discussions Mar 4, 2019
@Umsturz
Copy link
Author

Umsturz commented Apr 6, 2019

I experimented a little more. It seems that the uniform branch really finds some moves normal leela (0.16) does not find. But it still takes quite sometime before the optimal move is really considered and further investigated. I do not know the specifics but recent discussion about LCB makes me wonder if LCB+uniform would improve perfomance even more? Could LCB be easily combined with uniform? Or maybe you already did...?

@alreadydone
Copy link
Owner

alreadydone commented Apr 16, 2019

Just pushed https://github.com/alreadydone/lz/tree/tensor-accum-uniform-0.17
https://github.com/alreadydone/lz/tree/tensor-accum-0.17 was pushed a few days ago.
These have official 0.17 release merged in, including LCB.

@Umsturz
Copy link
Author

Umsturz commented Apr 20, 2019

Thank you for the update. The new version with 0.17 seems to have some problem because gpu util is only always around 30-40%. Before gpu util was 80-99%. I used --worker 3 --batchsize 32 and also tried lower batchsizes but gpu never goes higher than ~30%. Do I have to adjust the parameters for 0.17?

@alreadydone
Copy link
Owner

@Umsturz Thanks for the report. The problem is now fixed. In the earlier verison, the engine doesn't read batchsize from command line and set it equal to 1 always, due to some glitch in merging.

@alreadydone alreadydone changed the title tensor-accum-dev+/uniform questions/discussions tensor-accum-0.17/dev+/uniform questions/discussions Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants