-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensor-accum-0.17/dev+/uniform questions/discussions #87
Comments
Yeah some compilers can't deal with this. I suggest changing |
May I ask what compiler you are using? I now tried with gcc 5.4. and clang 3.8. Even after changing
|
I think the error indicates you need `#include <atomic_base> in OpenCL.h. People compiled successfully on Ubuntu before; gcc 8.1.0 seems to be working. |
Thank you it works with #include . Unfortunatly with multiple GPUS I dont see an improvement in ns. |
Thank you for testing! There definitely remains work to be done. Can you tell me what GPUs you have, what other branches (gcp/next, ihavnoid/batch-full, ihavnoid/tensorcore, or else?) you are comparing my branch with, and what parameters (--batchsize, -t) you are using in each case? |
You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+. 38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command (both with 24vCPUs) You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs ( |
Looks very promising! I will look into it during the weekend.
By the way with so many readouts is there a way to increase exploration?
… Am 26.02.2019 um 03:43 schrieb Junyan Xu ***@***.***>:
You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+.
Tested on Google Cloud:
15270 pos/s with 4xV100, 256x19 net, and command
./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4
38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command
./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz
You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2, etc. The -t parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
A bug has been fixed in the tensor-accum-dev+ approach. An experimental branch that gradually push policy towards uniform as visits increase to widen the search and help finding blind spots is https://github.com/alreadydone/lz/tree/tensor-accum-uniform (based on tensor-accum-dev+). |
@alreadydone That's really nice! Progressive squashing even better than any of my fix formulas... |
So I tried tensor-accum-uniform. There is no need for #include anymore, right? I compiled it with #include first and eventhough it compiled, leelaz threw some error on startup. But without the #include it worked. Do I need something else? For benchmarking I start leelaz and send "genmove B". I tried two different sets of parameters, without using --uniform-visits and --exponent : B) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 The first game with A) started with B playing Tengen(K10). Quite interesting to say the least. First and second game with B) looked normal, same opening the current nets like to play. All 4-4 points and 6-3 approach. Later double approach. With leela#207 40x256 I get with A) for the first genmove B ca. 25000-27000ns. What confuses me a little bit is the GPU utilization. During the first genmove B "nvidia-smi -l" shows following util. 0%/14%/43%/0%/28%/13%/0%/44%/ (just an example, but I tested this a couple of times and only some GPUs are utilized others stay at 0%. Maybe because of bad timing in the beginning by nvidia-smi.) Sometimes when exiting leelaz with "exit" it throws a segmentation fault(core dumped). All in all it looks very promising (1.4x improvement), but Tengen makes me a bit skeptical ; ) |
|
I experimented a little more. It seems that the uniform branch really finds some moves normal leela (0.16) does not find. But it still takes quite sometime before the optimal move is really considered and further investigated. I do not know the specifics but recent discussion about LCB makes me wonder if LCB+uniform would improve perfomance even more? Could LCB be easily combined with uniform? Or maybe you already did...? |
Just pushed https://github.com/alreadydone/lz/tree/tensor-accum-uniform-0.17 |
Thank you for the update. The new version with 0.17 seems to have some problem because gpu util is only always around 30-40%. Before gpu util was 80-99%. I used --worker 3 --batchsize 32 and also tried lower batchsizes but gpu never goes higher than ~30%. Do I have to adjust the parameters for 0.17? |
@Umsturz Thanks for the report. The problem is now fixed. In the earlier verison, the engine doesn't read batchsize from command line and set it equal to 1 always, due to some glitch in merging. |
Hi I tried to build your fastexit-tensor-accum+ on ubuntu 16.04. going by the steps in the readme, see below the error message. But the build fails with the following errors. Any idea how to fix this?
cmake --build .
[ 3%] Built target gtest
[ 7%] Built target gtest_main
[ 9%] Building CXX object CMakeFiles/objs.dir/src/UCTSearch.cpp.o
lz/src/UCTSearch.cpp:268:45: warning: unused parameter ‘thread_num’ [-Wunused-parameter]
int thread_num) {
^
lz/src/UCTSearch.cpp: In member function ‘int UCTSearch::think(int, UCTSearch::passflag_t)’:
lz/src/UCTSearch.cpp:860:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’
backup_queue = {};
^
lz/src/UCTSearch.cpp: In member function ‘void UCTSearch::ponder()’:
lz/src/UCTSearch.cpp:944:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’
backup_queue = {};
^
At global scope:
cc1plus: warning: unrecognized command line option ‘-Wno-mismatched-tags’
cc1plus: warning: unrecognized command line option ‘-Wno-ignored-attributes’
CMakeFiles/objs.dir/build.make:254: recipe for target 'CMakeFiles/objs.dir/src/UCTSearch.cpp.o' failed
make[2]: *** [CMakeFiles/objs.dir/src/UCTSearch.cpp.o] Error 1
CMakeFiles/Makefile2:143: recipe for target 'CMakeFiles/objs.dir/all' failed
make[1]: *** [CMakeFiles/objs.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2
Build instructions from the readme:
sudo apt install clinfo && clinfo
git clone https://github.com/gcp/leela-zero
cd leela-zero
git submodule update --init --recursive
sudo apt install libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
mkdir build && cd build
cmake ..
cmake --build .
./tests
curl -O https://zero.sjeng.org/best-network
./leelaz --weights best-network
The text was updated successfully, but these errors were encountered: