forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RDNA support #62
Draft
avbokovoy
wants to merge
165
commits into
ROCm:main
Choose a base branch
from
avbokovoy:rdna-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Add RDNA support #62
+2,940
−2,949
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Hipify code * Add correctness check * Revert "Add correctness check" This reverts commit a7f169dcc862e5cc8102a39eb3b7882dfa888f1b. * Fix setup.py * Add run_all.sh * Update Zipf index generation Update the Zipf index generation to generate unique indices in each bag and shuffle indices to avoid spatial locality Code reference: https://github.com/pytorch/FBGEMM/blob/7588d9d804826b428fc0e4fd418e9cc3f7a72e52/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py#L98-L117 * Fix ROCm version check in fbgemm_gpu's setup.py * Fix hipification errors Modify code to fix hipification errors. Some ops/kernels including merge_pooled_embeddings, quantize_ops and embedding_forward_quantized_split ops are diabled currently. These ops will be enabled in the future. * Disable AVX512 for AMD CPUs AMD CPUs do not support AVX512. Thus, it has to be disabled in ROCm. * Update run_all.sh * Fix __launch_bounds__ with kWarpSize. * fix missing '#endif' in codegen/embedding_backward_code_generator.py * fix the dependencies import in setup.py * debug enum cudaMemeryAdvise * bypass the both cudaMemoryAdvise cudaMemAdvise are mapped to hipMemAdvise, in cumem_utils.cu * Build and import successfully but with NAN values. * NAN values are eliminated by bypassing res.vals[0] = hfma2( * Remove debug lines in include/fbgemm_gpu/fbgemm_cuda_utils.cuh Note: The tests of fbgemm-gpu do not pass. They will be addressed in future commits. Co-authored-by: Sarunya Pumma <[email protected]> Co-authored-by: Li Li <[email protected]> Co-authored-by: liligwu <[email protected]>
Rocm4.3/develop. Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()
* Change hipify dependency from torch.utils.torch_hipify to hipify_torch. * add the third_party/hipify_torch to git repo
* unify function signature of jagged_xD_to_dense (pytorch#813) Summary: Pull Request resolved: pytorch#813 As title Reviewed By: jiaqizhai, jianyuh Differential Revision: D33066551 fbshipit-source-id: 8e2fd3c21f3bde67c6b20045681c2549e3583bd3 * Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33183467 fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de * Assert Tensors allocated on GPU. (pytorch#819) Summary: Pull Request resolved: pytorch#819 Check inputs for correctness wrt to GPU allocation and device. Reviewed By: jspark1105, jianyuh Differential Revision: D33167469 fbshipit-source-id: 04f638d13bde93373d64cff1428ef743300400a6 * Support batched benchmark execution and fix benchmark stats reporting (pytorch#818) Summary: Pull Request resolved: pytorch#818 As title, support multiple execution of benchmark scripts and report aggregated metric. Further, require `--bag-size` argument to conform to input data file for proper metric accounting. Reviewed By: jianyuh Differential Revision: D33182257 fbshipit-source-id: a6eeeb25646c00665b6d29df9389eddab7618d4e * Direct Convolution JIT assembly for KH=2, KW = 6 Summary: this diff has specialized codegen for convolution case where KH=2 and KW=6 ## Performance results on local devserver with AVX2 instruction: 1, 16, 16, {2, 126}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0}, {1, 1}, {0, 0}, false Fbgemm baseline: 3.8 GOPS This diff: 9.2 GOPS 1, 64, 64, {2, 257}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0}, {1, 1}, {0, 0}, false Fbgemm baseline: 43.8 GOPS This diff: 61.2 GOPS ## How to invoke indirect convolution function: **At offline:** 1. Weights need to be transposed to (oc/8) - (kh) - (kw) - (ic/4) - 8 - 4 2. Create the convolution function based on problem size: ``` CodeGenBase<uint8_t, int8_t, int32_t, int32_t> codeObj; CodeGenBase<uint8_t, int8_t, int32_t, int32_t>::jit_micro_kernel_fp fn; fn = codeObj.getOrCreateDirectConv<inst_set_t::avx2>( true, conv_p.OUT_DIM[1], conv_p.IN_DIM[1] * conv_p.IC, conv_p.stride[1] * conv_p.IC); ``` 3. Compute the *col_offsets* of weight tensor 4. Make sure you have allocated the space for: output tensor (Cint32_fb, Cint8_fb), and some temporary space for input rowsum ( InSum: IN_DIM[0] x IN_DIM[1], rowSum: OUT_DIM[0] x OUT_DIM[1]) **Online:** Make sure we have: conv_p ( the problem info), Aint8 (input tensor), bBuf_tr ( the transposed weight tensor), Cint32_fb ( the 32-bit results after accumulation), Cint8_fb ( the final quantized 8-bit output). // compute direct conv row sum directConvRowSum(conv_p, Aint8.data(), inSum, rowSum, row_offsets.data()); // kernel for direct convolution for (int oc = 0; oc < conv_p.OC; oc+= 8) { fn(Aint8.data(), bBuf_tr.data() + oc * kernel_dim * conv_p.IC , bBuf_tr.data(), Cint32_fb.data() + oc, conv_p.IC * conv_p.K[1], conv_p.OC); } requantizationParams_t<> reqObj = { Aint8_zero_point, // Aq_zero_point Bint8_zero_point.data(), C_zero_point, C_multiplier.data(), rowSum, // row_offsets //row_offsets.data(), col_offsets.data(), // col_offsets nullptr, // bias static_cast<std::uint32_t>(conv_p.OC), // ncols 1, // groups nullptr}; requantizeOutputProcessingAvx2<false, false, QuantizationGranularity::TENSOR, false, false>(Cint8_fb.data(), Cint32_ref.data(), {0, conv_p.OUT_DIM[1] * conv_p.OUT_DIM[0], 0, conv_p.OC}, conv_p.OC, conv_p.OC, reqObj); For more details please refer to test_asmjit2.cc Reviewed By: dskhudia Differential Revision: D31775222 fbshipit-source-id: 294450613b0978277e75d171d6a560124c14ecda * suppress errors in `deeplearning/fbgemm/fbgemm_gpu` Differential Revision: D33201593 fbshipit-source-id: 251f338e03dfde1dcc4a83c4ff9df1fe27840bdb * fix copy right header of batch_benchmark_run.py (pytorch#820) Summary: Pull Request resolved: pytorch#820 As title Reviewed By: jianyuh Differential Revision: D33213812 fbshipit-source-id: d901e87ff1047ff969c99a330aa05c8d26e1954e * Assert Tensors allocated on GPU for generated code. (pytorch#821) Summary: Pull Request resolved: pytorch#821 Check inputs for correctness wrt to GPU allocation and device. Reviewed By: jspark1105 Differential Revision: D33189944 fbshipit-source-id: 36fb5eac677466e783ef5a754c28b6d838ea09b7 * Move all fbgemm_gpu provided Python ops to fbgemm namespace from fb namespace. (pytorch#823) Summary: Pull Request resolved: pytorch#823 Reviewed By: jianyuh Differential Revision: D33147038 fbshipit-source-id: fdcb667dfb920b4f04b7d0b08082afabe7213cc1 * Implement generic HBC by feature. (pytorch#822) Summary: Pull Request resolved: pytorch#822 Implement a generic version of HBC by feature, which takes in bin_boundaries. Reviewed By: jianyuh Differential Revision: D33232676 fbshipit-source-id: 99c77f6d081fdc89699948a6c9482b8806f598a3 * Benchmark for newly added generic HBC by feature. (pytorch#826) Summary: Pull Request resolved: pytorch#826 More benchmarking for new op, and also add "double" for benchmarking type. Reviewed By: jianyuh Differential Revision: D33241845 fbshipit-source-id: 38f08f5453fd8d112ff55c046a6ac091c23bc3de * Allways set dontfork on managed Tensor + new uvm clone (pytorch#824) Summary: Pull Request resolved: pytorch#824 Workaround for S256045. UVM Tensors are unmapped from the process page table on fork (spawn). The UVM fault handler then slows down the UVM CPU<->CPU copy substantially reestablishing those mappings. The workaround sets MADV_DONTFORK on the addresses (rounded down to page size) of UVM allocations - this prevents the removal from UVM pages from the original process page table. Additionally this introduces a single threaded UVM->CPU tensor copy to 1) Avoid 8 trainers on a host to concurrently all threads with copy_ 2) Avoid high concurency in the fault handler of the uvm kernel driver. Reviewed By: jianyuh Differential Revision: D33192043 fbshipit-source-id: 094f3dcd302d455efbf4e912d58ed28756cb653f * Use kWarpSize for warp size (pytorch#827) Summary: Pull Request resolved: pytorch#827 Reviewed By: rweyrauch Differential Revision: D33271792 fbshipit-source-id: dc66b6950b37e5d92c10406a3891568a7500e26e * Move fb.embedding_bag_rowwise_prune to fbgemm_gpu OSS. (pytorch#825) Summary: Pull Request resolved: pytorch#825 Move the fb.embedding_bag_rowwise_prune op from caffe2/fb/sparsenn to fbgemm_gpu. Reviewed By: jianyuh Differential Revision: D33240318 fbshipit-source-id: 4db93a1ecd9666881779eeada1e3e493aa7525e4 * Allow optional Tensor args to be empty or on GPU. (pytorch#828) Summary: Pull Request resolved: pytorch#828 Reviewed By: jianyuh Differential Revision: D33267641 fbshipit-source-id: b193ee5b7e9ea946a20672760c320f29b217b998 * Add output_dtype to training TBE op for CPU (pytorch#829) Summary: Pull Request resolved: pytorch#829 This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)). Reviewed By: jianyuh Differential Revision: D32969921 fbshipit-source-id: 695e54434dc4f65f9f4c60782c60a550e38d97a7 * fix copyright header of tensor_assert_test.cpp (pytorch#831) Summary: Pull Request resolved: pytorch#831 As title Reviewed By: rweyrauch Differential Revision: D33310866 fbshipit-source-id: 1cbdee1d7c00f0e900faac570bac330866887b1c * Add permute_pooled_embedding_modules_test into RE (pytorch#830) Summary: Pull Request resolved: pytorch#830 As title Reviewed By: rweyrauch Differential Revision: D33303898 fbshipit-source-id: c94a14bc398ecb58b68ca15d7e79204233ac67d1 * Use all to one op to do DtoD between remote and merge (pytorch#817) Summary: Pull Request resolved: pytorch#817 Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices. See the blue DtoD copies in the graph below. {F686842812} At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case. This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device. For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies. Now the copies from different devices can run in parallel. {F686843333} Reviewed By: yinghai, jianyuh, houseroad Differential Revision: D33065710 fbshipit-source-id: f479fa2ea20702e14419c8b87024a87d5bbb1a68 * Add MSFP option for ads hpc model numeric emulations (pytorch#832) Summary: Pull Request resolved: pytorch#832 Add fake conversions between MSFP and fp32 in both forward and backward pass of the hpc ads model training. TODO: Add compute kernels that split the FC operator into gemms for column_blocks of activations and row_blocks of weights Reviewed By: jspark1105 Differential Revision: D30942234 fbshipit-source-id: 601d671fd00622304a50651dedffd0de3ae01ae0 * Remove benchmark CMakeLists.txt (pytorch#835) Summary: Pull Request resolved: pytorch#835 As title. This file is no longer needed after we decide to support setup.py only OSS build approach. Reviewed By: jspark1105, rweyrauch Differential Revision: D33318121 fbshipit-source-id: 4f71b23f6e9e7e78d50fab20af53cdf9f63844ad * Increase code reuse between FP32, FP16, INT8, INT4 embedding types for infer TBE (pytorch#833) Summary: Pull Request resolved: pytorch#833 We merge the implementation for {FP32, FP16, INT8, INT4} weights in inference TBE into one unified template and increase the code reuse between these implementations. This will pave the way for the future enhancements (no need to change all 4 implementations for one new feature). Reviewed By: rweyrauch Differential Revision: D33343450 fbshipit-source-id: 24e59c4a2df5ef3da353535eb879a2365293bc1f * minimize functions defined in headers (pytorch#836) Summary: Pull Request resolved: pytorch#836 We had so much stuffs that didn't need to be at header files. Split long source files. Put experimental quantization functions to experimental namespace Reviewed By: rweyrauch Differential Revision: D33358916 fbshipit-source-id: cffcec344cbe565045ee2c564ce1cef529de4cf8 * add missing C10_CUDA_KERNEL_LAUNCH_CHECK (pytorch#837) Summary: Pull Request resolved: pytorch#837 As title Reviewed By: rweyrauch Differential Revision: D33359025 fbshipit-source-id: 162dd2897a5d56e7ac8ff3ba9ae5c8689961204b * Add seq embedding kernel for infer TBE (pytorch#834) Summary: Pull Request resolved: pytorch#834 - Add sequence embedding support in infer TBE kernel - TODO: "mask" solution for the duplicated embedding row access. cc jspark1105 Reviewed By: jspark1105 Differential Revision: D33341863 fbshipit-source-id: 47babe921dbaf086e2df92f4693b4718c01bcec1 * add missing new files to CMakeLists.txt (pytorch#838) Summary: Pull Request resolved: pytorch#838 This was missed in D33358916 (pytorch@38a6c35) Reviewed By: colin2328 Differential Revision: D33370387 fbshipit-source-id: 72007f51afd6757690a1898098e8b6207c3c487b * Support int32_t indices/offsets for caching handling logics (pytorch#811) Summary: Pull Request resolved: pytorch#811 In training, we assume the indices / offsets are int64_t for embedding (TBE), but in inference, we assume the indices / offsets are int32_t. This Diff enables both int32_t and int64_t supports for the caching logics so that we can reuse the same functions for both training and inference, while reducing the extra overhead to convert the indices/offsets from int to long or vice versa. Reviewed By: jspark1105 Differential Revision: D33045589 fbshipit-source-id: 4e508a1095536a629bdab8e5577db74310032b23 * Add seq embedding benchmark Summary: 5x ~ 10x speedup in the benchmark level. Reviewed By: jspark1105 Differential Revision: D33355933 fbshipit-source-id: 2c609ae9ec5fd4fda48dbafa13b5eb75900fdf5f * fix warning count check in test_bounds_check (pytorch#839) Summary: Pull Request resolved: pytorch#839 In GPU multiple threads in a thread block can increase warning count for the same bound errors in offset array Reviewed By: jianyuh Differential Revision: D33379301 fbshipit-source-id: b00520cc613bb7e15c9f8cd4bdf0c61bd4dbd83b * fix typo in CMakeLists.txt (pytorch#840) Summary: Pull Request resolved: pytorch#840 Fixing a silly typo Reviewed By: jianyuh Differential Revision: D33380967 fbshipit-source-id: 8220cc87a2564107cb124d3f9c31b8d92cb7d1a4 * Slight perf optimization for infer TBE (pytorch#843) Summary: Pull Request resolved: pytorch#843 ~5% perf improvement for INT4 / INT8 inference TBE on A100 GPUs. Reviewed By: jspark1105 Differential Revision: D33388153 fbshipit-source-id: 63566e3dccd9ce4775abb3374251f9046512e131 * extract embedding input transpose out of embedding_backward_split_template.cu (pytorch#841) Summary: Pull Request resolved: pytorch#841 Refactoring to prepare D33381126 Other minor changes * Remove unused sorted_linear_indices_run_lengths parameter from bwd kernels Reviewed By: jianyuh Differential Revision: D33380032 fbshipit-source-id: b880cc3745a6f6dd63319109e753a470d6c28c49 * increase parallelism in batched unary embeddings backward (pytorch#842) Summary: Pull Request resolved: pytorch#842 Sort indices and have each thread handle indices with the same values (called a run in the code) Reviewed By: jianyuh Differential Revision: D33381126 fbshipit-source-id: aec1c0be619b9072f5a1f9273b66c03e5106ca02 * use DISPATCH_TO_CUDA macro (pytorch#845) Summary: Pull Request resolved: pytorch#845 We should use the macro consistently or just drop Reviewed By: jianyuh Differential Revision: D33392682 fbshipit-source-id: bd99286f55fe2d6e5bab231ec65dae02f16f35c2 * Follow-up comments (pytorch#844) Summary: Pull Request resolved: pytorch#844 Reviewed By: jspark1105 Differential Revision: D33393019 fbshipit-source-id: 1df7d8457a950a829f7ff2fe6f47595afdc9cc26 * HIP extension support for FBGEMM_GPU (pytorch#846) Summary: Pull Request resolved: pytorch#846 Reviewed By: jspark1105 Differential Revision: D33231489 fbshipit-source-id: 6bd46ddee45c767ad25c2d52b6c05030bba94082 * correct the max_shared_bytes logit evaluation logic in embedding_backward_split_template.cu * IFU from from upstream commit c6df576 to main. fbgemm-gpu is built and imported. Tests do NOT pass. Co-authored-by: Xing Liu <[email protected]> Co-authored-by: CodemodService FBSourceClangFormatLinterBot <> Co-authored-by: Rick Weyrauch <[email protected]> Co-authored-by: Martin Schatz <[email protected]> Co-authored-by: Jiyuan Zhang <[email protected]> Co-authored-by: Jongsoo Park <[email protected]> Co-authored-by: Jason Park <[email protected]> Co-authored-by: Stephan Uphoff <[email protected]> Co-authored-by: Jianyu Huang <[email protected]> Co-authored-by: Shintaro Iwasaki <[email protected]> Co-authored-by: Shiyan Deng <[email protected]> Co-authored-by: Summer Deng <[email protected]>
* * added skipIfRocm and TEST_WITH_ROCM in split_table_batched_embeddings_test. * added __any_sync_fbgemm that replaces __any_sync. * 26 tests ran in split_table_batched_embeddings_test 10 skipped. * *Renamed __any_sync_fbgemm to __any_sync and changed its implementation to a more generic one. *Added 'reason' message of skipIfRocm. * *enabled use_array_for_index_remapping in test_nbit_forward_int and test_nbit_forward_fp. *enabled test_nbit_forward_pruning. * deleted 'assert(false)' tthat are related to __any_sync function.
…y by diabling use_cpu.
…ove @skipIfRocm for TestFused8BitRowwiseQuantizationConversion and TestFusedNBitRowwiseQuantizationConversion
Enable use_cache
… have been deleted in upstream.
Removed post_hipify logic in setup.py
…torch_to_new_commit Pointing hipify_torch to the newer commit.
…PATH in setup.py. (ROCm#19)
* An attempt of matching upstream setup.py. * Move hipify() to CMakeList.txt. * Removing hipify from the python script. * Matching upstream setup.py * #Removing the unnecessary funcitons and statements in Hip.cmake. #Reforming some of the compilation option lists in CMakeList.txt. * Updating hipify_torch (CMake API) * #Adding automatically detection for CUDA and ROCm. #Removing the debug code in embedding_backward_code_generator.py. #Adding 'gfx90a' in FBGEMM_ROCM_ARCH. #Minor changes on message and indentation.
* Enable merge_pooled_embeddings op. in ROCm * Enabling the merge pool ops. Co-authored-by: liligwu <[email protected]>
====================================================================== Two tests failures: ====================================================================== ERROR: test_generic_histogram_binning_calibration_by_feature (__main__.SparseOpsTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "sparse_ops_test.py", line 1500, in test_generic_histogram_binning_calibration_by_feature data_type=st.sampled_from([torch.half, torch.float32]), File "/opt/conda/lib/python3.7/site-packages/hypothesis/core.py", line 1220, in wrapped_test raise the_error_hypothesis_found File "sparse_ops_test.py", line 1543, in test_generic_histogram_binning_calibration_by_feature bin_ctr_weight_value=0.9995, RuntimeError: expected scalar type Long but found Int ---------------------------------------------------------------------- FAIL: test_lxu_cache_lookup (__main__.SplitTableBatchedEmbeddingsTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "split_table_batched_embeddings_test.py", line 3994, in test_lxu_cache_lookup dtype=torch.int, AssertionError: False is not true ---------------------------------------------------------------------- Ran 35 tests in 759.368s FAILED (failures=1)
…CM_ARCH. # Enabling building on Pytorch 1.11.
Summary: Pull Request resolved: pytorch#2483 - Migrate backward templates in `codegen/` over to `codegen/training/backward` Reviewed By: spcyppt Differential Revision: D55825078 fbshipit-source-id: 5fdbce85b91cba2bb9b661154bbd52aa1d35a1fd
…`fbcode/deeplearning/fbgemm/fbgemm_gpu` (pytorch#2482) Summary: Pull Request resolved: pytorch#2482 Reviewed By: inseokhwang Differential Revision: D55817659 fbshipit-source-id: 5983e092d351d46b37deddc03f61a3fa3bb20633
Summary: Pull Request resolved: pytorch#2484 - Migrate other code in `codegen/` over to `codegen/training` Reviewed By: spcyppt Differential Revision: D55828101 fbshipit-source-id: 477acabc75687769374ecebd882ac659fa161100
Summary: Pull Request resolved: pytorch#2486 - Migrate headers in `codegen/` over to `include/` Reviewed By: spcyppt Differential Revision: D55896420 fbshipit-source-id: 5b374808bfc9df4b3bce6a30b96dcb47e20f8494
Summary: Pull Request resolved: pytorch#2488 SymInt-ify scheme of batch_index_select as it can be used via torchrec VB path with SymInt Reviewed By: ezyang Differential Revision: D55923134 fbshipit-source-id: b7b69d307c866cf39fad3e6e44a50e24295a0109
Summary: Changes in Nova framwork does not work with our current script. pytorch/test-infra#5071 (comment) Switch to use release branch of Nova workflow to guarantee that the workflow is stable throughout the release cycle. Pull Request resolved: pytorch#2489 Reviewed By: q10, huydhn Differential Revision: D55947486 fbshipit-source-id: e12552776bccb443cd1983fc037e8c08715c0749
Summary: Pull Request resolved: pytorch#2490 - Migrate headers in `codegen/` over to `include/` Reviewed By: spcyppt Differential Revision: D55946558 fbshipit-source-id: a714fd5d74b4761810e2afc828298def35c61868
Summary: Pull Request resolved: pytorch#2492 - Migrate `embedding_bounds_check_*` over to `codegen/utils` Reviewed By: spcyppt Differential Revision: D55950626 fbshipit-source-id: bd1931fd7af69c68651e76eac2f75dd231756951
Summary: Pull Request resolved: pytorch#2495 Create an abstract impl for fbgemm operator bounds_check_indices Reviewed By: IvanKobzarev Differential Revision: D56020789 fbshipit-source-id: e5291bb553614a57d405f4d6c4b901c1a03818b4
Summary: Pull Request resolved: pytorch#2493 [Fallback option](pytorch/test-infra@751efc1) is added and merge for Nova @ main branch. We can safely revert D55947486 to use main branch. D55947486 was to mainly unblock nightly releases and to be used for OSS release cycle. Note that the change of [switching to Nova release branch](pytorch#2489 (comment)) had already been cherrypicked into release branch (v0.7.0-release) Reviewed By: q10 Differential Revision: D55993279 fbshipit-source-id: 56161076e6d3f52dccb0be3f4cddd0efea0f6668
Summary: - Update build scripts to enable building fbgemm_gpu under `python -m build` instead of `python setup.py bdist_wheel` Pull Request resolved: pytorch#2491 Reviewed By: spcyppt Differential Revision: D56004117 Pulled By: q10 fbshipit-source-id: 1f84a89d277c4781fc478514aedb678a6b24dab8
Summary: Previously this code conformed from clang-format 12. Reviewed By: igorsugak Differential Revision: D56065247 fbshipit-source-id: f5a985dd8f8b84f2f9e1818b3719b43c5a1b05b3
Summary: Pull Request resolved: pytorch#2499 Reviewed By: connernilsen Differential Revision: D56084668 fbshipit-source-id: 7d3a971735285d69e162f9e3c31aa7a6e0792d2e
Summary: Pull Request resolved: pytorch#2502 Fix bounds_check_indices opchecktest error ``` torch/_subclasses/fake_tensor.py:844] TypeError: bounds_check_indices_abstract() takes from 5 to 6 positional arguments but 8 were given ``` Reviewed By: sryap Differential Revision: D56151143 fbshipit-source-id: a150581d14a458a1791bb24a2fa202c904c43c4e
Summary: Pull Request resolved: pytorch#2503 Suppress OSS linter error. OSS and Internal linter are conflicting regarding import format. Following internal lint suggestions caused OSS lint errors. Reviewed By: sarckk Differential Revision: D56169748 fbshipit-source-id: 34bdca0ecc7c0975b853d6fc7278639f0a79da2c
… pooled block_bucketize_sparse_features (pytorch#2500) Summary: Pull Request resolved: pytorch#2500 If shared memory size per thread block required for `_block_bucketize_pooled_sparse_features_cuda_kernel2` exceeds 48KB, this kernel launch will fail as this is the default shared memory size limit. To support larger bucket sizes (for large world size), we set the dynamic shared memory size as 2/3 of total available shared mem size (V100: 64 KB; A100: 96 KB; H100: 144 KB) Reviewed By: sryap Differential Revision: D56127859 fbshipit-source-id: 0f7fd201aa48010d58490038f2a1ebb01aae1a22
Summary: Pull Request resolved: pytorch#2505 A few changes to TBE and its benchmark: * inplace fill_random_weights * expose uvm_host_mapped option to benchmark * fix requests gen error with reuse when L > 1 Reviewed By: sryap Differential Revision: D56081004 fbshipit-source-id: c5b75c330933cde7cc46409c6b1418be5cf431e4
Summary: Pull Request resolved: pytorch#2509 Previously, some TBE optimizer unit tests hung indefinitely causing the unit tests to timeout. We were able to reproduce this problem consistently by using the config in D50612178 (composed by ezyang). The main characteristics of this config are (1) the optimizer is PARTIAL_ROWWISE_ADAM, (2) the embedding dimension is less than 32, and (3) it contains long segments (i.e., some indices are repeated with extremely high counts). Upon our investigation, we identified that the value reduction in PARTIAL_ROWWISE_ADAM was implemented incorrectly. The optimizer intended to perform a value reduction within a sub-warp (i.e., a group of threads in a warp) instead of an entire warp. (Note that sub-warp reduction is done when the embedding dimension is smaller than the warp size). However, it did not pass a correct `shfl_sync` mask. The wrong mask expected an entire warp to perform the reduction. When the segment length is long (> 32), only one sub-warp would perform the reduction. Such the warp divergence caused the kernel execution to freeze. (Note that the reduction is a collective operation). This diff fixes the issue by passing a correct mask when invoking the reduction function. Reviewed By: shintaro-iwasaki Differential Revision: D56223375 fbshipit-source-id: 3ca62f1f02ba6254bb4a716a775974f35d91d78f
Summary: - Refactor CMakeLists.txt to support building experimental targets - Add example folder to demonstrate building an experimental target Pull Request resolved: pytorch#2501 Reviewed By: spcyppt Differential Revision: D56265492 Pulled By: q10 fbshipit-source-id: c1b3778d78fda528bbe79b29ee2b44e7e1411a1c
…rch#2510) Summary: Pull Request resolved: pytorch#2510 Since the value of `num_dense_output_rows` has been computed in forward. Directly save this number (not optional) in context and reuse it to avoid D2H syncs in backward. Reviewed By: sryap Differential Revision: D56212536 fbshipit-source-id: 3008745310f9056192dac417c775b553428f9bfc
Summary: Pull Request resolved: pytorch#2504 This diff open sources `gqa_attn_splitk`, which is a grouped-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary number of query heads - Fixed number of KV head of 1 The INT4 KV `gqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Reviewed By: jianyuh, amylittleyang Differential Revision: D56110657 fbshipit-source-id: 26055b2a37fbd71851a6ba206fe8ad1b00737328
Ifu 2024 04 19
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Account 32 warp size for RDNA. Currently some of the backward unit test still fail