forked from LeelaChessZero/lc0
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rescore tb #48
Open
Naphthalin
wants to merge
106
commits into
Naphthalin:rescore_tb
Choose a base branch
from
Tilps:rescore_tb
base: rescore_tb
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Rescore tb #48
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3% increase in nps on benchmark. 100% increase in nps in #1 positions.
…o#1503) * Early exit gathering if all collisions and backend is idle. * Fix logic. * Even more aggressive. * Don't enable if threads=1. * Parameterise the behavior.
Co-authored-by: borg323 <[email protected]>
* Actively split pick tasks as early as possible. Should reduce picking latency. * Formatting.
* HashKeyedCache Has less pointer chasing than the previous version. * Fix one critical bug and one tiny bug. The first item in each run after an erase/swap_to_erased was incorrectly *always* considered to be InRange - so unless the run continued past that point, it would not be swapped in to the correct position, and nor would anything else. Thus the item could no longer be found in the cache. Luckily Evict code is overzealous and will always remove the item eventually (at a large cpu cost) to avoid cache flood, but inability to find the item in the cache normally means unpin doesn't work, so such items all go into an ever growing evict list, which progressively makes unpin slower and slower. also was missing a return path when removing a pin from evict list that wasn't the last pin. Meaning it would pointlessly search the main list, fail and then trigger the debug assert (if debug asserts were enabled in the build...) * Fix some comments. * Code review and a bug fix.
Co-authored-by: borg323 <[email protected]>
Co-authored-by: borg323 <[email protected]>
also do the softmax after unlocking the gpu
…sZero#1526) * Apply drift correction to q and d values in training data. * Increase eps based on testing. * Switch to CERR. * Update allowed_eps based on testing with a non-broken backend... Also include actual detail in the log messages.
…her moves aren't proven losses (LeelaChessZero#1521)
* Don't divide by zero after the instamove. (which makes all remaining moves instamoves) * Float literal is nicer.
* Clamp smooth to ensure move_overhead is respected. * Fix for circleci compilation.
Forgotten during the 0.28 release.
LeelaChessZero#1661) * Use a special kernel for handling > 384 filter res block fusion that allocates board[8][8] array in shared memory instead of registers. * This works only on Ampere GPUs because the kernel needs more than 64KB shared memory that is available on Turing.
Authored-by: gmorenz <[email protected]>
* add some code for parsing attention policy weights cuda backend changes TODO * add WIP changes for attention policy networks * Add support for computing PromotionLogits - still totally untested * add support for attention policy map - time to test now? * minor fixes from Arcturai - fix build errors in cudnn backend. - update weights.encoder_pol() -> weights.encoder() to match new net.proto * fix a couple of bugs * In Layer norm and Softmax kernels: - shared memory not initialized - check for first thread in warp incorrect. * taking pointer to encoder object allocated on stack! * fix transpose even cuda fp16 backend uses nchw layout these days, so need the transpose. * fix bug in layer norm kernel don't use same sum variable to compute sum of differences. * fix loading of weights for attention policy another silly bug! * fix concat bug * fix bug introduced in code refactoring the row major matrix multiply for conv1layer is special :-/ * fixes from Tilps * remove hardcoding of encoder_head_count * update net.proto - adjust scratch/tensor memory computation to avoid corner cases. * Add FP32 support * move fp16 utils to common path - and use that inside cuda backend. * forgot adding the moved files to utils * remove duplicate enum * fix build failures - at least try to... * remove ip4_pol_b_ - bias not used for the "ppo" layer * fix comment * fix tensor/scratch size calculation * remove debug code/fix output size
…sZero#1675) * WAR for cublas issue with version 11+ - set CUBLAS_PEDANTIC_MATH to avoid cublas making use of tensor math on TU11x GPUs. * Update cuda_common.h - as per borg's suggestion
* Fork of the active attention PR for me to mess with. * Some 'fixes'? * Some more fixes merged in or otherwise. * Cleaning up. * Update lczero-common * Cleanups and adding mish net support - almost. * more cudnn backend support cases, maybe... * Fused supports act NONE - don't explode. * Cleanup to let diff apply cleaner. * Another cleanup. * Merge fp32 from ankan. * Merge another diff. * Another merge. * added checks (#19) * added checks * warning fixes Co-authored-by: borg323 <[email protected]> * Auto format common_kernels.cu * forgot to hit save. * autoformat fp16_kernels.cu * auto format kernels.h * autoformat layers.cc * autoformat layers.h * autoformat network_cuda.cc * autoformat network_cudnn.cc * Autoformat network_legacy.cc * autoformat winograd_helper.inc * cudnn backend does support mish. Co-authored-by: borg323 <[email protected]> Co-authored-by: borg323 <[email protected]>
* misc changes to cudnn backend - replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as source (in CPU memory) could be deleted before the async version of the function actually does the copy. - minor naming/style changes. - add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works. * fix typo in comment * clang-format * address review comment * Add 320 and 352 channel support for fused SE layer - just add template instantiations. - verified that it works and provides a (very) slight speedup. * Update fp16_kernels.cu * Simpler kernel for res-block fusion without SE - use constant block size of 64, splitting channel dimension also into multiple blocks as needed. - This allows arbitrarily large filter counts without running out of register file. * minor refactoring - allow using res block fusing opt for alternate layers (that don't have SE) even on GPUs that don't have enough shared memory. * minor functional fix * a few more fixes to get correct output hopefully functionally correct now. * fix cudnn backend build - missed the fact that it also uses Res block fusion :-/ * fix build errors * some more fixes * minor cleanup * remove --use_fast_math - as it doesn't improve performance. - some minor cleanup * fix indentation
* attention opts * minor fix from prev PR * minor fixes - use activation as template param for addBiasBatched kernel. Just 2-4 microseconds improvement only on GA100. - fix build break. * fix tensor size calc * fix incorrect input to promotion logits kernel
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
update branch