Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescore tb #48

Open
wants to merge 106 commits into
base: rescore_tb
Choose a base branch
from
Open

Rescore tb #48

wants to merge 106 commits into from

Conversation

Naphthalin
Copy link
Owner

update branch

Tilps and others added 30 commits January 31, 2021 09:25
3% increase in nps on benchmark.
100% increase in nps in #1 positions.
…o#1503)

* Early exit gathering if all collisions and backend is idle.

* Fix logic.

* Even more aggressive.

* Don't enable if threads=1.

* Parameterise the behavior.
* Actively split pick tasks as early as possible.

Should reduce picking latency.

* Formatting.
* HashKeyedCache

Has less pointer chasing than the previous version.

* Fix one critical bug and one tiny bug.

The first item in each run after an erase/swap_to_erased was incorrectly *always* considered to be InRange - so unless the run continued past that point, it would not be swapped in to the correct position, and nor would anything else. Thus the item could no longer be found in the cache.
Luckily Evict code is overzealous and will always remove the item eventually (at a large cpu cost) to avoid cache flood, but inability to find the item in the cache normally means unpin doesn't work, so such items all go into an ever growing evict list, which progressively makes unpin slower and slower.
also was missing a return path when removing a pin from evict list that wasn't the last pin. Meaning it would pointlessly search the main list, fail and then trigger the debug assert (if debug asserts were enabled in the build...)

* Fix some comments.

* Code review and a bug fix.
also do the softmax after unlocking the gpu
…sZero#1526)

* Apply drift correction to q and d values in training data.

* Increase eps based on testing.

* Switch to CERR.

* Update allowed_eps based on testing with a non-broken backend...

Also include actual detail in the log messages.
* Don't divide by zero after the instamove.
(which makes all remaining moves instamoves)

* Float literal is nicer.
* Clamp smooth to ensure move_overhead is respected.

* Fix for circleci compilation.
borg323 and others added 30 commits August 25, 2021 16:11
Forgotten during the 0.28 release.
LeelaChessZero#1661)

 * Use a special kernel for handling > 384 filter res block fusion that allocates board[8][8] array in shared memory instead of registers.
 * This works only on Ampere GPUs because the kernel needs more than 64KB shared memory that is available on Turing.
* add some code for parsing attention policy weights

cuda backend changes TODO

* add WIP changes for attention policy networks

* Add support for computing PromotionLogits

 - still totally untested

* add support for attention policy map

- time to test now?

* minor fixes from Arcturai

 - fix build errors in cudnn backend.
 - update weights.encoder_pol() -> weights.encoder() to match new net.proto

* fix a couple of bugs

* In Layer norm and Softmax kernels:
 - shared memory not initialized
 - check for first thread in warp incorrect.
* taking pointer to encoder object allocated on stack!

* fix transpose

even cuda fp16 backend uses nchw layout these days, so need the transpose.

* fix bug in layer norm kernel

don't use same sum variable to compute sum of differences.

* fix loading of weights for attention policy

another silly bug!

* fix concat bug

* fix bug introduced in code refactoring

the row major matrix multiply for conv1layer is special :-/

* fixes from Tilps

* remove hardcoding of encoder_head_count

* update net.proto

- adjust scratch/tensor memory computation to avoid corner cases.

* Add FP32 support

* move fp16 utils to common path

- and use that inside cuda backend.

* forgot adding the moved files to utils

* remove duplicate enum

* fix build failures

- at least try to...

* remove ip4_pol_b_

 - bias not used for the "ppo" layer

* fix comment

* fix tensor/scratch size calculation

* remove debug code/fix output size
…sZero#1675)

* WAR for cublas issue with version 11+

- set CUBLAS_PEDANTIC_MATH to avoid cublas making use of tensor math on TU11x GPUs.

* Update cuda_common.h

- as per borg's suggestion
* Fork of the active attention PR for me to mess with.

* Some 'fixes'?

* Some more fixes merged in or otherwise.

* Cleaning up.

* Update lczero-common

* Cleanups and adding mish net support - almost.

* more cudnn backend support cases, maybe...

* Fused supports act NONE - don't explode.

* Cleanup to let diff apply cleaner.

* Another cleanup.

* Merge fp32 from ankan.

* Merge another diff.

* Another merge.

* added checks (#19)

* added checks

* warning fixes

Co-authored-by: borg323 <[email protected]>

* Auto format common_kernels.cu

* forgot to hit save.

* autoformat fp16_kernels.cu

* auto format kernels.h

* autoformat layers.cc

* autoformat layers.h

* autoformat network_cuda.cc

* autoformat network_cudnn.cc

* Autoformat network_legacy.cc

* autoformat winograd_helper.inc

* cudnn backend does support mish.

Co-authored-by: borg323 <[email protected]>
Co-authored-by: borg323 <[email protected]>
* misc changes to cudnn backend

- replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as  source (in CPU memory) could be deleted before the async version of the function actually does the copy.
- minor naming/style changes.
- add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works.

* fix typo in comment

* clang-format

* address review comment

* Add 320 and 352 channel support for fused SE layer

- just add template instantiations.
- verified that it works and provides a (very) slight speedup.

* Update fp16_kernels.cu

* Simpler kernel for res-block fusion without SE

 - use constant block size of 64, splitting channel dimension also into multiple blocks as needed.
 - This allows arbitrarily large filter counts without running out of register file.

* minor refactoring

 - allow using res block fusing opt for alternate layers (that don't have SE) even on GPUs that don't have enough shared memory.

* minor functional fix

* a few more fixes to get correct output

hopefully functionally correct now.

* fix cudnn backend build

 - missed the fact that it also uses Res block fusion :-/

* fix build errors

* some more fixes

* minor cleanup

* remove --use_fast_math

- as it doesn't improve performance.
- some minor cleanup

* fix indentation
* attention opts

* minor fix from prev PR

* minor fixes

- use activation as template param for addBiasBatched kernel. Just 2-4 microseconds improvement only on GA100.
 - fix build break.

* fix tensor size calc

* fix incorrect input to promotion logits kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.