Rescore tb #48

Naphthalin · 2022-10-28T09:34:37Z

update branch

3% increase in nps on benchmark. 100% increase in nps in #1 positions.

…o#1503) * Early exit gathering if all collisions and backend is idle. * Fix logic. * Even more aggressive. * Don't enable if threads=1. * Parameterise the behavior.

Co-authored-by: borg323 <[email protected]>

* Actively split pick tasks as early as possible. Should reduce picking latency. * Formatting.

* HashKeyedCache Has less pointer chasing than the previous version. * Fix one critical bug and one tiny bug. The first item in each run after an erase/swap_to_erased was incorrectly *always* considered to be InRange - so unless the run continued past that point, it would not be swapped in to the correct position, and nor would anything else. Thus the item could no longer be found in the cache. Luckily Evict code is overzealous and will always remove the item eventually (at a large cpu cost) to avoid cache flood, but inability to find the item in the cache normally means unpin doesn't work, so such items all go into an ever growing evict list, which progressively makes unpin slower and slower. also was missing a return path when removing a pin from evict list that wasn't the last pin. Meaning it would pointlessly search the main list, fail and then trigger the debug assert (if debug asserts were enabled in the build...) * Fix some comments. * Code review and a bug fix.

…omments. (LeelaChessZero#1508)

Co-authored-by: borg323 <[email protected]>

…laChessZero#1489)

Co-authored-by: borg323 <[email protected]>

…1519)

also do the softmax after unlocking the gpu

…sZero#1526) * Apply drift correction to q and d values in training data. * Increase eps based on testing. * Switch to CERR. * Update allowed_eps based on testing with a non-broken backend... Also include actual detail in the log messages.

…her moves aren't proven losses (LeelaChessZero#1521)

* Don't divide by zero after the instamove. (which makes all remaining moves instamoves) * Float literal is nicer.

* Clamp smooth to ensure move_overhead is respected. * Fix for circleci compilation.

… overhead. (LeelaChessZero#1530)

…ps in search. (LeelaChessZero#1547)

)

…pport. (LeelaChessZero#1509)

…essZero#1637)

…Zero#1638)

Forgotten during the 0.28 release.

…ssZero#1640)

…o#1653)

LeelaChessZero#1661) * Use a special kernel for handling > 384 filter res block fusion that allocates board[8][8] array in shared memory instead of registers. * This works only on Ampere GPUs because the kernel needs more than 64KB shared memory that is available on Turing.

Authored-by: gmorenz <[email protected]>

* add some code for parsing attention policy weights cuda backend changes TODO * add WIP changes for attention policy networks * Add support for computing PromotionLogits - still totally untested * add support for attention policy map - time to test now? * minor fixes from Arcturai - fix build errors in cudnn backend. - update weights.encoder_pol() -> weights.encoder() to match new net.proto * fix a couple of bugs * In Layer norm and Softmax kernels: - shared memory not initialized - check for first thread in warp incorrect. * taking pointer to encoder object allocated on stack! * fix transpose even cuda fp16 backend uses nchw layout these days, so need the transpose. * fix bug in layer norm kernel don't use same sum variable to compute sum of differences. * fix loading of weights for attention policy another silly bug! * fix concat bug * fix bug introduced in code refactoring the row major matrix multiply for conv1layer is special :-/ * fixes from Tilps * remove hardcoding of encoder_head_count * update net.proto - adjust scratch/tensor memory computation to avoid corner cases. * Add FP32 support * move fp16 utils to common path - and use that inside cuda backend. * forgot adding the moved files to utils * remove duplicate enum * fix build failures - at least try to... * remove ip4_pol_b_ - bias not used for the "ppo" layer * fix comment * fix tensor/scratch size calculation * remove debug code/fix output size

…sZero#1675) * WAR for cublas issue with version 11+ - set CUBLAS_PEDANTIC_MATH to avoid cublas making use of tensor math on TU11x GPUs. * Update cuda_common.h - as per borg's suggestion

* Fork of the active attention PR for me to mess with. * Some 'fixes'? * Some more fixes merged in or otherwise. * Cleaning up. * Update lczero-common * Cleanups and adding mish net support - almost. * more cudnn backend support cases, maybe... * Fused supports act NONE - don't explode. * Cleanup to let diff apply cleaner. * Another cleanup. * Merge fp32 from ankan. * Merge another diff. * Another merge. * added checks (#19) * added checks * warning fixes Co-authored-by: borg323 <[email protected]> * Auto format common_kernels.cu * forgot to hit save. * autoformat fp16_kernels.cu * auto format kernels.h * autoformat layers.cc * autoformat layers.h * autoformat network_cuda.cc * autoformat network_cudnn.cc * Autoformat network_legacy.cc * autoformat winograd_helper.inc * cudnn backend does support mish. Co-authored-by: borg323 <[email protected]> Co-authored-by: borg323 <[email protected]>

* misc changes to cudnn backend - replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as source (in CPU memory) could be deleted before the async version of the function actually does the copy. - minor naming/style changes. - add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works. * fix typo in comment * clang-format * address review comment * Add 320 and 352 channel support for fused SE layer - just add template instantiations. - verified that it works and provides a (very) slight speedup. * Update fp16_kernels.cu * Simpler kernel for res-block fusion without SE - use constant block size of 64, splitting channel dimension also into multiple blocks as needed. - This allows arbitrarily large filter counts without running out of register file. * minor refactoring - allow using res block fusing opt for alternate layers (that don't have SE) even on GPUs that don't have enough shared memory. * minor functional fix * a few more fixes to get correct output hopefully functionally correct now. * fix cudnn backend build - missed the fact that it also uses Res block fusion :-/ * fix build errors * some more fixes * minor cleanup * remove --use_fast_math - as it doesn't improve performance. - some minor cleanup * fix indentation

* attention opts * minor fix from prev PR * minor fixes - use activation as template param for addBiasBatched kernel. Just 2-4 microseconds improvement only on GA100. - fix build break. * fix tensor size calc * fix incorrect input to promotion logits kernel

Tilps and others added 30 commits January 31, 2021 09:25

Increase use of workspace. (LeelaChessZero#1498)

d2e03fd

3% increase in nps on benchmark. 100% increase in nps in #1 positions.

Early exit if backend is idle and there is work to do. (LeelaChessZer…

0e1c34d

…o#1503) * Early exit gathering if all collisions and backend is idle. * Fix logic. * Even more aggressive. * Don't enable if threads=1. * Parameterise the behavior.

also generate profile for multi-gather (LeelaChessZero#1504)

ac7e6bc

Co-authored-by: borg323 <[email protected]>

Actively split pick tasks as early as possible. (LeelaChessZero#1497)

d09a137

* Actively split pick tasks as early as possible. Should reduce picking latency. * Formatting.

Better default value for minimum picking work and clean up some old c…

88c4c18

…omments. (LeelaChessZero#1508)

bypass meson compiler detection in build.cmd (LeelaChessZero#1505)

5af07df

update build.cmd (LeelaChessZero#1484)

c8399dd

fix cudnn pgo builds (LeelaChessZero#1450)

03ae687

Co-authored-by: borg323 <[email protected]>

replace floor in FastPow2 and rename to FastExp2 (LeelaChessZero#1485)

e8c34be

check mkl_include dir exists (LeelaChessZero#1441)

7c627b6

use neon instructions on arm (LeelaChessZero#1451)

36b3b1f

merge v0.26.3 changelog into master (LeelaChessZero#1512)

754d2bd

check for illegal move (LeelaChessZero#1514)

55d02af

Negative wdl (LeelaChessZero#1362)

617b4f9

Update parameters. (LeelaChessZero#1475)

9a9bdcf

Add SmartPruningStopper logic if 0 or 1 non losing moves remain. (Lee…

4e0ddcf

…laChessZero#1489)

update version and changelog before branching (LeelaChessZero#1516)

ba022e7

Co-authored-by: borg323 <[email protected]>

Fix crash applying moves without a built tree. (LeelaChessZero#1518)

5d0d375

Fix crash instead of trying to validate legal moves. (LeelaChessZero#…

b1bf3f3

…1519)

replace cudnn softmax with cpu version (LeelaChessZero#1425)

bfb1d53

also do the softmax after unlocking the gpu

fixed accidental flipping of sign when trying to determine whether ot…

e16dd26

…her moves aren't proven losses (LeelaChessZero#1521)

Don't divide by zero after the instamove. (LeelaChessZero#1527)

9fa7bfa

* Don't divide by zero after the instamove. (which makes all remaining moves instamoves) * Float literal is nicer.

Clamp smooth to ensure move_overhead is respected. (LeelaChessZero#1528)

4541885

* Clamp smooth to ensure move_overhead is respected. * Fix for circleci compilation.

Ensure allowed piggybank time doesn't push total time limit past move…

a5771a6

… overhead. (LeelaChessZero#1530)

Make smooth time manager the default. (LeelaChessZero#1531)

fd703cc

Clippy Threshold update. (LeelaChessZero#1532)

7fe47d3

Increase default moooef (LeelaChessZero#1538)

ac6feee

Return lost check for whether root node is DTZ before doing WDL looku…

f1eef4a

…ps in search. (LeelaChessZero#1547)

borg323 and others added 30 commits August 25, 2021 16:11

set task workers to 0 for cpu packages (LeelaChessZero#1629)

727fcd0

restrict res_block_fusing to filters multiple of 32 (LeelaChessZero#1628

aa3bc32

)

'Minimal' change to allow 64 byte node by removing non-multigather su…

2687775

…pport. (LeelaChessZero#1509)

Merge branch 'master' into rescore_tb

c9294cb

Post merge fixes.

e344f44

Missed cleanup.

2cae9c9

Upstream a rescorer patch that was missed in the big import. (LeelaCh…

857330c

…essZero#1637)

More correctly clear some fields when doing a v5 upgrade. (LeelaChess…

399f90c

…Zero#1638)

Update 0.27 to 0.28 (LeelaChessZero#1635)

7818cf2

Forgotten during the 0.28 release.

onnx backend fixes (LeelaChessZero#1644)

1b676d5

move GetV6TrainingData() and related files to new directory (LeelaChe…

cb341db

…ssZero#1640)

warning fixes (LeelaChessZero#1643)

139b9d1

loader fix for nets with onnx model (LeelaChessZero#1648)

aa9da88

fix for windows builds of onnx (LeelaChessZero#1652)

1de6643

Update linux build instructions (LeelaChessZero#1651)

15f1536

Update README.md (LeelaChessZero#1636)

252088e

A command line tool to do various Leela related stuff. (LeelaChessZer…

223a8af

…o#1653)

implement dense policy for onnx backend (LeelaChessZero#1656)

d293188

"Trivial" backend. (LeelaChessZero#1662)

d2e372e

Fix typo in comment about net input format (LeelaChessZero#1634)

d551f7c

Authored-by: gmorenz <[email protected]>

fix cuda se block threading (LeelaChessZero#1685)

3464660

Fix input format change bug that can corrupt played_idx and best_idx

16d8060

WAR for cublas issue on TU11x GPUs with version CUDA 11.0+ (LeelaChes…

fc3265b

…sZero#1675) * WAR for cublas issue with version 11+ - set CUBLAS_PEDANTIC_MATH to avoid cublas making use of tensor math on TU11x GPUs. * Update cuda_common.h - as per borg's suggestion

Merge branch 'master' into rescore_tb

50cb9a5

Post merge fixes.

a69c7fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescore tb #48

Rescore tb #48

Naphthalin commented Oct 28, 2022

Rescore tb #48

Are you sure you want to change the base?

Rescore tb #48

Conversation

Naphthalin commented Oct 28, 2022