forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FBGEMM Ifu 2023 10 26 #47
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Pull Request resolved: pytorch#1977 As titled This is helpful for code debugging. Reviewed By: q10 Differential Revision: D48727799 fbshipit-source-id: 1820b9dc18909600a1586cd2ab5789ab53a7d6d2
) Summary: Pull Request resolved: pytorch#1973 - Remove debug_synchronous from CUB call sites in FBGEMM ops Reviewed By: sryap Differential Revision: D48722495 fbshipit-source-id: 47a92dc82e9fe271d719913f8a842f7fa2c8f36f
Summary: Pull Request resolved: pytorch#1990 As titled. Reviewed By: jspark1105 Differential Revision: D48927429 fbshipit-source-id: dc5c11471ba2bb8bbc330d25b49c839deb14ac5c
Summary: Pull Request resolved: pytorch#1992 Preparing for reducing cache size work. Reviewed By: jspark1105 Differential Revision: D48940544 fbshipit-source-id: e0047c784918d886a935f2ec059a05fbc5064e6f
Summary: Pull Request resolved: pytorch#1951 D40518654 introduced `uvm_cache_stats` to provide cache metrics for FBGEMM 32way cache. This diff expands its usage to also provide cache metrics for direct mapped cache. Reviewed By: sryap, doehyun Differential Revision: D48023956 fbshipit-source-id: d183cf39ee848e9cb4e27686b7f45ca0a162370b
…orch#1999) Summary: Pull Request resolved: pytorch#1999 Hypothesis version 6.83.2 onwards introduces `HealthCheck.differing_executors` that causes tests in`permute_pooled_embedding_test.py` to fail with error: `The method PooledEmbeddingModulesTest.setUp was called from multiple different executors. This may lead to flaky tests and nonreproducible errors when replaying from database`. Currently, we're using the latest version of hypothesis on CI: https://github.com/pytorch/FBGEMM/actions/runs/6084855480/job/16515052387 Current hypothesis on FBCode is 6.70.1 which does not have `HealthCheck.differing_executors`. Reviewed By: shintaro-iwasaki Differential Revision: D49020046 fbshipit-source-id: 8ab1350411260c771baf05efe607f91c12df2385
Summary: Pull Request resolved: pytorch#1998 - Assemble build script for flexible release testing Reviewed By: spcyppt Differential Revision: D49011206 fbshipit-source-id: bb08e76803ac6f9e6d644fd04db9f24a25e0fce1
Summary: …workflows - Include the option to disable publishing to PyPI when running the Release and Nightly OSS workflows Pull Request resolved: pytorch#2003 Reviewed By: spcyppt Differential Revision: D49040359 Pulled By: q10 fbshipit-source-id: 8cede1906e3b827a22522305ce31e9505383c4fd
Summary: Pull Request resolved: pytorch#1994 For the MTIA device type, we have to rely on the CPU fallback. So we let the FBGEMM chose CPU path when the device type == MTIA. Reviewed By: jackm321 Differential Revision: D48809630 fbshipit-source-id: 15bec60be6efe4c8b1ad4f9d46da39ce58e36a40
Summary: …workflows, pt. 2 - Fix the logic around the Publish to PyPI toggle to allow for publishing in more scenarios Pull Request resolved: pytorch#2004 Reviewed By: spcyppt Differential Revision: D49042306 Pulled By: q10 fbshipit-source-id: 757fd3db27ed1dab332f20c417a26600a0beef20
Summary: Pull Request resolved: pytorch#1989 - Remove unsupported type dispatch from FBGEMM ops, pt. 1 Reviewed By: sryap Differential Revision: D48895311 fbshipit-source-id: 05a0ee7e3db10bbe720b01457a08b29de6f8afdc
Summary: Pull Request resolved: pytorch#1952 - Implement python frontend for the previous diff (split for backward compatibility) - Revise the existing benchmark to use the uvm_cache_stats for stats instead of cache_miss_counter. - Implement unit test for uvm_cache_stats for direct mapped. Reviewed By: doehyun Differential Revision: D48439568 fbshipit-source-id: cc3b36402e8038b44c83d1e701004b68129563d4
Summary: Pull Request resolved: pytorch#1939 This diff builds ontop of the pervious diff and adds support for permute_duplicate_pooled_embeddings for CPU. # Background Currently permute_pooled_embs_gpu does not support duplicates in a permutation, this poses a problem with passing the same embeddings to multiple modules. This doc proposes a solution to allow duplicate subsets in the resultant permutation. # Details The required implementation of permute_duplicate_pooled_embs_gpu should support a subset being repeated. This is represented by having duplicates in the permute list. This also results in the output list size being greater than the input list. Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] Offset_dims: [0, 2, 5, 6, 10] Permute: [3, 0, 2, 1, 3] Output: [6, 7, 8, 9, 0, 1, 5, 2, 3, 4, 6, 7, 8, 9] Reviewed By: sryap Differential Revision: D48305145 fbshipit-source-id: e308338f105eb95f8066f554b8143dfb4524c7e6
Summary: Pull Request resolved: pytorch#1993 Fix invalid CUDA configuration error for the empty input KJT. When the input KJT is an empty Tensor, the total number of indices is zero. In this case, the jagged_unique_indices op launches 0 thread blocks for the `delinearize_unique_index` (div_round_up(total_indices, kMaxThreads) is 0), which results in CUDA errors. Reviewed By: sryap Differential Revision: D48978631 fbshipit-source-id: f21da7005a7b66427a9e8a607be244d8c2ac5583
…rate indices and offsets (pytorch#1882) Summary: Pull Request resolved: pytorch#1882 Reviewed By: archishman Differential Revision: D47448915 fbshipit-source-id: 2e98a3ebb4e3a42c3e41be53142fcbac3734dada
Summary: Pull Request resolved: pytorch#2007 Fix nightly publishing logic. Currently it attempts to publish all versions when for the scheduled run. https://github.com/pytorch/FBGEMM/actions/runs/6121960197/job/16617350408 Reviewed By: sryap Differential Revision: D49106347 fbshipit-source-id: 29ecd25f1f2a1835941c59123cf328e48ef806ba
Summary: - Add CUDA version selection on artifact publishing - Add new workflow file for testing installations of FBGEMM_GPU through pip Pull Request resolved: pytorch#2005 Reviewed By: shintaro-iwasaki Differential Revision: D49084679 Pulled By: q10 fbshipit-source-id: 2406734aebef9d998e06c733bdb7377ea30bcd61
Summary: Pull Request resolved: pytorch#2008 Fix nightly publishing logic. Currently it attempts to publish all versions when for the scheduled run. https://github.com/pytorch/FBGEMM/actions/runs/6121960197/job/16617350408 Reviewed By: q10 Differential Revision: D49117790 fbshipit-source-id: 78a10703cba494d2691b43e3223e94ba3ac99989
Summary: - Auto-generate the version file in OSS, so that the `__version__` symbol is available in the package Pull Request resolved: pytorch#2011 Reviewed By: shintaro-iwasaki Differential Revision: D49174509 Pulled By: q10 fbshipit-source-id: d0dfd1ab0a2912016ad6e003bf88feeab696fa4a
Summary: - Add network connectivity checks to tell the user to re-run the command with `with-proxy` if needed Pull Request resolved: pytorch#2014 Reviewed By: sryap Differential Revision: D49256971 Pulled By: q10 fbshipit-source-id: 6d14d348d389d9a244b50de66edc39030fbb7beb
…ion (pytorch#2015) Summary: Pull Request resolved: pytorch#2015 The reference implementation of FP8 quantization is in Python, but the actual implementation is in C++/CUDA. Upon summerdengfb's investigation, Python has a known floating point representation issue (https://www.geeksforgeeks.org/floating-point-error-in-python/). This could cause quantization result discrepancy. To workaround this issue, we allow 1 bit difference in the FP8 quantization result (LSB of mantissa) in `TestFused8BitRowwiseQuantizationConversion`. Reviewed By: q10, shintaro-iwasaki Differential Revision: D49255499 fbshipit-source-id: b28294f8076bda61589e10699119375f03b091a8
…ytorch#2016) Summary: Pull Request resolved: pytorch#2016 To alleviate CUDA version mismatch issues, we aim to publish fbgemm-gpu-nightly with different CUDA versions. This diff uses Nova workflow and will host the published wheels at PyTorch site instead. pytorch#1947 Reviewed By: q10 Differential Revision: D49258503 fbshipit-source-id: a06d095b0c03df62d8cea8fb8db1b5018c9a9dd7
Summary: Pull Request resolved: pytorch#2000 We split the diff after adding a needed lazy cuda init call in enable p2p access function. Diff 1: D48939723 [PyTorch] Add the lazy init call for p2p access function *Prior context* cudaEnablePeerAccess only enables cross device access for memory allocated with cudaMalloc. When using other cuda APIs such cuMemMap, peer access is managed differently. expandable_segments:True in PyTorch uses cuMemMap, so code that just calls cudaEnablePeerAccess is not sufficient to enable cross-device copies. This patch switching the p2p access enabling functions to use PyTorchs `get_p2p_access` which lets its allocator figure out how to correctly enable p2p access for that memory. In the normal case (expandable_segments:False), this code performs exactly the same cuda calls as before. Reviewed By: zdevito Differential Revision: D49021817 fbshipit-source-id: 7ffb4b477b1d1cddccc891dd9fc8f9a2a986585e
Summary: Pull Request resolved: pytorch#2020 Update nova workflow to be triggered on pushes to main and releases branches and tags. Update wheel version name to match pytorch convention (e.g., +cpu, +cu118) Reviewed By: q10 Differential Revision: D49296399 fbshipit-source-id: e4ed56c12dfb2f68d5b131b29c8ce0294e2a4522
Summary: Pull Request resolved: pytorch#2019 As titled Reviewed By: jianyuh Differential Revision: D49296564 fbshipit-source-id: 442c13567cb7aa8de8c208c2ee1fb2ae550a8969
Summary: Pull Request resolved: pytorch#2018 As titled Reviewed By: jianyuh, henrylhtsang, edqwerty10 Differential Revision: D49295738 fbshipit-source-id: 45524d8e220ba6b686a99d201e24c6a3d839aed7
Summary: Pull Request resolved: pytorch#2023 PyPI does not accept version to be specified with `+` (e.g., `+cpu` or `+cu118`). Reviewed By: sryap Differential Revision: D49329679 fbshipit-source-id: 1bc0c81ae15fdf157b54b2ca9226d67dacaade8e
…#2022) Summary: Pull Request resolved: pytorch#2022 Pull Request resolved: pytorch#2021 Nova creates conda environments with `--prefix` where the environments are outside of the default `envs` folder. Current script only works with `--name`. This diff changes the scripts to accommodate conda prefix. https://github.com/pytorch/FBGEMM/actions/runs/6189731547 Reviewed By: spcyppt Differential Revision: D49306286 fbshipit-source-id: 2f2ca00645526639369de3f555dbab30da56739e
Summary: Pull Request resolved: pytorch#2024 Add arm builds through Nova to be hosted on pytorch.org Reviewed By: osalpekar Differential Revision: D49333436 fbshipit-source-id: 5d8c192afe10db5ed9b5c6bb35f823d66794e81a
…ch#2009) Summary: Pull Request resolved: pytorch#2009 For the case of loading FP8 tables with 128B * 8 < dimension <= 128B * 16, the numbers for (MinNum128BRows, MaxNum128BRows) should be (8, 16) and not (4, 8 ). Because of this bug, FP8 tables with dimension in that range don't get properly loaded upon codegen_forward. I wrote a little test to show this deterministically. I will remove the test before land if it is too specific. Reviewed By: jianyuh Differential Revision: D49139649 fbshipit-source-id: 300e844885e7a0f8e6e8193f03f135ceb76def28
Summary: Pull Request resolved: pytorch#2064 Reviewed By: spcyppt Differential Revision: D49978436 Pulled By: q10 fbshipit-source-id: 74ec5bcc121b79774e20e14480a624e35f28f1e7
…t/sparse_ops_test.py (pytorch#2050) Summary: Pull Request resolved: pytorch#2050 Reviewed By: zou3519 Differential Revision: D49736391 fbshipit-source-id: c0f7f66cef89365325726805a30cdfe232721af6
Summary: Pull Request resolved: pytorch#2046 **Replace `memset` with vector store for the L=0 case (L = pooling factor)** `memset` causes extra byte writes which results in lower performance. The compiler seems to not be aware that the memory is aligned with the cache line when using `memset`. **Add a short circuit path for writing zeros when every bag in a feature has L=0** TBE v2 does an expensive explicit register-shared memory spill before performing look up. For L=0, these registers can be bypassed because there is no look up to be performed. We bypass the register spill if all Ls are zeros. Reviewed By: jasonjk-park Differential Revision: D49605180 fbshipit-source-id: 66d5886a28bb8c75aa446f08ac1b165d6c298de6
Summary: Pull Request resolved: pytorch#2051 Disable the subwarp optimization for weighted TBE - The cause of the NE issue has to be investigated. - This does not affect the overall E2E performance much since the majority of TBEs in the model is unweighted Reviewed By: jasonjk-park Differential Revision: D49159119 fbshipit-source-id: 805639d94d4ce7b3be8f275db4dfd0ecc95a539a
…ytorch#2061) Summary: Pull Request resolved: pytorch#2061 Before this diff, `keyed_jagged_index_select_dim1` kernels take raw pointers as arguments. This requires the input tensors to be contiguous. However, the `keyed_jagged_index_select_dim1` operator did not make sure that the tensors are contiguous before extracting and passing the raw pointers to the kernels causing the correctness issue. This diff replaces the raw pointer arguments with PyTorch's `PackedTensorAccessor` which handles non-contiguous tensor accesses automatically. For some tensors that their raw pointers are still being used, the operator makes sure that the tensors are contiguous before using them. Reviewed By: jasonjk-park, venkatrsrinivas Differential Revision: D49939713 fbshipit-source-id: e941379eeda65fc998be6f506b0e467fe74a48a5
Summary: Pull Request resolved: pytorch#1881 - Split up split_embeddings_cache_cuda.cu Reviewed By: sryap, spcyppt Differential Revision: D47491160 fbshipit-source-id: 9c025f06dbc003fe8734d64ae5a102b1a37fd5a5
Summary: - Add retries to Python tests with delays in between retries Pull Request resolved: pytorch#2071 Reviewed By: spcyppt Differential Revision: D50207135 Pulled By: q10 fbshipit-source-id: 254a541f3007890ea6e11890552d52b6e6e44932
…ch#2069) Summary: Pull Request resolved: pytorch#2069 Add meta function for fbgemm::merge_pooled_embeddings operator to enable PT2 export of models containign this operator Reviewed By: khabinov Differential Revision: D49709130 fbshipit-source-id: 7c9bee0a82332f6ba78938d50c8bfd138051649f
…#2070) Summary: Pull Request resolved: pytorch#2070 This diff adds autogenerated opcheck tests to the jagged tensor operator tests. We skip some torch.compile and gradcheck tests because those take an unreasonable amount of time to run. Reviewed By: ezyang Differential Revision: D50129561 fbshipit-source-id: e03595488a7162c9cdff4a70ff06bc3d5ebad521
…s_test.py Differential Revision: D50129561 Original commit changeset: e03595488a71 Original Phabricator Diff: D50129561 fbshipit-source-id: 1b7ccc2c8e781cfdc635eeb7d83129335f071ce1
Summary: Pull Request resolved: pytorch#2072 same as title. Follow the same pattern with ``asynchronous_complete_cumsum`` in https://www.internalfb.com/diff/D47834850. Reviewed By: ezyang Differential Revision: D50195064 fbshipit-source-id: c2506688ef01c7e56392f05bc1d0d7ee9a856cab
Summary: Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count` Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows: ``` [ RUN ] cpuKernelTest.raidx_sort_heap_overflow /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure Expected equality of these values: expected_keys Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 } keys Which is: { -1, -1, -1, -1, -1, -1, -1, -1 } /home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure Expected equality of these values: expected_values Which is: { 1, 4, 6, 7, 8, 2, 3, 5 } values Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 } [ FAILED ] cpuKernelTest.raidx_sort_heap_overflow (0 ms) ``` Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version Pull Request resolved: pytorch#2075 Reviewed By: kit1980, jianyuh Differential Revision: D50256504 Pulled By: malfet fbshipit-source-id: f805607595e324999cea07dcacdee8317a008221
…#2073) Summary: Pull Request resolved: pytorch#2073 This PR turns on automatically generated opcheck tests for jagged_tensor_ops_test.py. I reverted the original one because the fbgemm OSS CI failures were real. Reviewed By: ezyang Differential Revision: D50241452 fbshipit-source-id: 98192cbbb7d0e4e72c3cafd3c85ab6ad3a166da1
Summary: Pull Request resolved: pytorch#2078 Check https://fb.workplace.com/groups/957614988799577/permalink/995501075010968/ Similar to D29056288 . Didn't know `split_embedding_weights_with_scale_bias` is used in forward path in torchscript? Reviewed By: q10, mengyingdu Differential Revision: D50474951 fbshipit-source-id: d043bcad27a381eef90ddde5a58eb890a983a475
Summary: Pull Request resolved: pytorch#2079 This diff shuffles the code to allow `batch_index_select_dim0` to get total D from `D_offsets` when it is still in the host memory. Reviewed By: q10, bigning Differential Revision: D50477287 fbshipit-source-id: 715de7711e052baf27e417a93f9a4267697bcdbb
Summary: Pull Request resolved: pytorch#2083 - Split up lfu_cache.cu for easier maintenance Reviewed By: spcyppt Differential Revision: D50522858 fbshipit-source-id: caa99dc973cabdb4fd61e85e23e3de1db568d788
…permute_1D_sparse_data (pytorch#2084) Summary: This also fixes a minor bug in GPU permute_1D_sparse_data where we need to clone the zero-size tensors to correctly setup (lack of) aliasing. Pull Request resolved: pytorch#2084 Reviewed By: sryap Differential Revision: D50563192 fbshipit-source-id: 1dc31580c54d8a0dfd3aadaf9b440636fd1a8550
Summary: Pull Request resolved: pytorch#2082 See previous diff. Reviewed By: zou3519 Differential Revision: D50509467 fbshipit-source-id: 7b4aab969ad4736e203c114cd13f379d05284501
Summary: Pull Request resolved: pytorch#2088 ROCm 6.0 introduces backwards-incompatible changes such as removing the long-deprecated use of `__HIP_PLATFORM_HCC__`. It is better to use the USE_ROCM macro which is already defined and indicates a ROCm build. This PR also defines `__HIP_PLATFORM_AMD__` which is the new symbol name. This symbol is still required for compiling with HIP headers but when not using hip-clang. Pull Request resolved: pytorch#2086 Reviewed By: sryap Differential Revision: D50580075 Pulled By: q10 fbshipit-source-id: 6ceaa8bb36d64f2665001f9fb8afe2d4e431acb7
…window Reviewed By: zou3519 Differential Revision: D50610665 fbshipit-source-id: 53bff35c79ae11634380d967e737bdf00ab2015e
Summary: Pull Request resolved: pytorch#2085 - Migrate split embeddings cache ops registrations into the split_embeddings_cache directory Reviewed By: spcyppt Differential Revision: D50569476 fbshipit-source-id: 838efdf376abfd6f07104a05d59343e122038183
Summary: Pull Request resolved: pytorch#2087 Reviewed By: zou3519 Differential Revision: D50584541 fbshipit-source-id: 6adcb833e779ba35464bdb3e4ec0f13fbe514f8a
Summary: Pull Request resolved: pytorch#2090 Reviewed By: zou3519 Differential Revision: D50586828 fbshipit-source-id: 2f92a717877dfaf7b56fbea56df67acc272fd8f5
Summary: Pull Request resolved: pytorch#2089 I'm not entirely sure about some of these (e.g., bin_ctr_in_use_after) but it should be harmless enough to accept symbolic SymInt at all these sites. Reviewed By: zou3519 Differential Revision: D50587015 fbshipit-source-id: 65484b2535753cd2732c80e16b7cc590ace9e723
Summary: Pull Request resolved: pytorch#2092 Reviewed By: zou3519 Differential Revision: D50613728 fbshipit-source-id: 5b3a8d91166a491147c728e5c024de9e118088ef
Summary: Pull Request resolved: pytorch#2095 - Register fake CPU dispatch for split embeddings cache ops Reviewed By: sryap Differential Revision: D50665599 fbshipit-source-id: aca450e2e501e648ad07fb0ed8738ef5ec7e6bec
Summary: Pull Request resolved: pytorch#2097 Reviewed By: zou3519 Differential Revision: D50693694 fbshipit-source-id: 70b719d512a1de6f0ec49aca0fe4e794edb667d7
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A new EV is required to pass the tests
export PYTORCH_OPCHECK_ACCEPT=1
https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1056#issue-1971302275