Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FBGEMM Ifu 2023 10 26 #47

Merged
merged 95 commits into from
Nov 1, 2023
Merged

FBGEMM Ifu 2023 10 26 #47

merged 95 commits into from
Nov 1, 2023

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Oct 31, 2023

A new EV is required to pass the tests export PYTORCH_OPCHECK_ACCEPT=1
https://github.com/ROCmSoftwarePlatform/DeepLearningModels/pull/1056#issue-1971302275

sryap and others added 30 commits August 31, 2023 12:17
Summary:
Pull Request resolved: pytorch#1977

As titled

This is helpful for code debugging.

Reviewed By: q10

Differential Revision: D48727799

fbshipit-source-id: 1820b9dc18909600a1586cd2ab5789ab53a7d6d2
)

Summary:
Pull Request resolved: pytorch#1973

- Remove debug_synchronous from CUB call sites in FBGEMM ops

Reviewed By: sryap

Differential Revision: D48722495

fbshipit-source-id: 47a92dc82e9fe271d719913f8a842f7fa2c8f36f
Summary:
Pull Request resolved: pytorch#1990

As titled.

Reviewed By: jspark1105

Differential Revision: D48927429

fbshipit-source-id: dc5c11471ba2bb8bbc330d25b49c839deb14ac5c
Summary:
Pull Request resolved: pytorch#1992

Preparing for reducing cache size work.

Reviewed By: jspark1105

Differential Revision: D48940544

fbshipit-source-id: e0047c784918d886a935f2ec059a05fbc5064e6f
Summary:
Pull Request resolved: pytorch#1951

D40518654 introduced `uvm_cache_stats` to provide cache metrics for FBGEMM 32way cache.
This diff expands its usage to also provide cache metrics for direct mapped cache.

Reviewed By: sryap, doehyun

Differential Revision: D48023956

fbshipit-source-id: d183cf39ee848e9cb4e27686b7f45ca0a162370b
…orch#1999)

Summary:
Pull Request resolved: pytorch#1999

Hypothesis version 6.83.2 onwards introduces
`HealthCheck.differing_executors` that causes tests
in`permute_pooled_embedding_test.py` to fail with error:

`The method PooledEmbeddingModulesTest.setUp was called from multiple different executors. This may lead to flaky tests and nonreproducible errors when replaying from database`.

Currently, we're using the latest version of hypothesis on CI:

https://github.com/pytorch/FBGEMM/actions/runs/6084855480/job/16515052387

Current hypothesis on FBCode is 6.70.1 which does not have
`HealthCheck.differing_executors`.

Reviewed By: shintaro-iwasaki

Differential Revision: D49020046

fbshipit-source-id: 8ab1350411260c771baf05efe607f91c12df2385
Summary:
Pull Request resolved: pytorch#1998

- Assemble build script for flexible release testing

Reviewed By: spcyppt

Differential Revision: D49011206

fbshipit-source-id: bb08e76803ac6f9e6d644fd04db9f24a25e0fce1
Summary:
…workflows

- Include the option to disable publishing to PyPI when running the Release and Nightly OSS workflows

Pull Request resolved: pytorch#2003

Reviewed By: spcyppt

Differential Revision: D49040359

Pulled By: q10

fbshipit-source-id: 8cede1906e3b827a22522305ce31e9505383c4fd
Summary:
Pull Request resolved: pytorch#1994

For the MTIA device type, we have to rely on the CPU fallback. So we let the FBGEMM chose CPU path when the device type == MTIA.

Reviewed By: jackm321

Differential Revision: D48809630

fbshipit-source-id: 15bec60be6efe4c8b1ad4f9d46da39ce58e36a40
Summary:
…workflows, pt. 2

- Fix the logic around the Publish to PyPI toggle to allow for publishing in more scenarios

Pull Request resolved: pytorch#2004

Reviewed By: spcyppt

Differential Revision: D49042306

Pulled By: q10

fbshipit-source-id: 757fd3db27ed1dab332f20c417a26600a0beef20
Summary:
Pull Request resolved: pytorch#1989

- Remove unsupported type dispatch from FBGEMM ops, pt. 1

Reviewed By: sryap

Differential Revision: D48895311

fbshipit-source-id: 05a0ee7e3db10bbe720b01457a08b29de6f8afdc
Summary:
Pull Request resolved: pytorch#1952

- Implement python frontend for the previous diff (split for backward compatibility)
- Revise the existing benchmark to use the uvm_cache_stats for stats instead of cache_miss_counter.
- Implement unit test for uvm_cache_stats for direct mapped.

Reviewed By: doehyun

Differential Revision: D48439568

fbshipit-source-id: cc3b36402e8038b44c83d1e701004b68129563d4
Summary:
Pull Request resolved: pytorch#1939

This diff builds ontop of the pervious diff and adds support for permute_duplicate_pooled_embeddings for CPU.

# Background
Currently permute_pooled_embs_gpu does not support duplicates in a permutation, this poses a problem with passing the same embeddings to multiple modules. This doc proposes a solution to allow duplicate subsets in the resultant permutation.

# Details
The required implementation of permute_duplicate_pooled_embs_gpu should support a subset being repeated. This is represented by having duplicates in the permute list. This also results in the output list size being greater than the input list.

Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Offset_dims: [0,  2,  5,  6, 10]
Permute: [3, 0, 2, 1, 3]

Output:  [6, 7, 8, 9, 0, 1, 5, 2, 3, 4, 6, 7, 8, 9]

Reviewed By: sryap

Differential Revision: D48305145

fbshipit-source-id: e308338f105eb95f8066f554b8143dfb4524c7e6
Summary:
Pull Request resolved: pytorch#1993

Fix invalid CUDA configuration error for the empty input KJT. When the input KJT is an empty Tensor, the total number of indices is zero. In this case, the jagged_unique_indices op launches 0 thread blocks for the `delinearize_unique_index` (div_round_up(total_indices, kMaxThreads) is 0), which results in CUDA errors.

Reviewed By: sryap

Differential Revision: D48978631

fbshipit-source-id: f21da7005a7b66427a9e8a607be244d8c2ac5583
…rate indices and offsets (pytorch#1882)

Summary: Pull Request resolved: pytorch#1882

Reviewed By: archishman

Differential Revision: D47448915

fbshipit-source-id: 2e98a3ebb4e3a42c3e41be53142fcbac3734dada
Summary:
Pull Request resolved: pytorch#2007

Fix nightly publishing logic. Currently it attempts to publish all versions when for the scheduled run.

https://github.com/pytorch/FBGEMM/actions/runs/6121960197/job/16617350408

Reviewed By: sryap

Differential Revision: D49106347

fbshipit-source-id: 29ecd25f1f2a1835941c59123cf328e48ef806ba
Summary:
- Add CUDA version selection on artifact publishing
- Add new workflow file for testing installations of FBGEMM_GPU through pip

Pull Request resolved: pytorch#2005

Reviewed By: shintaro-iwasaki

Differential Revision: D49084679

Pulled By: q10

fbshipit-source-id: 2406734aebef9d998e06c733bdb7377ea30bcd61
Summary:
Pull Request resolved: pytorch#2008

Fix nightly publishing logic. Currently it attempts to publish all versions when for the scheduled run.
https://github.com/pytorch/FBGEMM/actions/runs/6121960197/job/16617350408

Reviewed By: q10

Differential Revision: D49117790

fbshipit-source-id: 78a10703cba494d2691b43e3223e94ba3ac99989
Summary:
- Auto-generate the version file in OSS, so that the `__version__` symbol is available in the package

Pull Request resolved: pytorch#2011

Reviewed By: shintaro-iwasaki

Differential Revision: D49174509

Pulled By: q10

fbshipit-source-id: d0dfd1ab0a2912016ad6e003bf88feeab696fa4a
Summary:
- Add network connectivity checks to tell the user to re-run the command with `with-proxy` if needed

Pull Request resolved: pytorch#2014

Reviewed By: sryap

Differential Revision: D49256971

Pulled By: q10

fbshipit-source-id: 6d14d348d389d9a244b50de66edc39030fbb7beb
…ion (pytorch#2015)

Summary:
Pull Request resolved: pytorch#2015

The reference implementation of FP8 quantization is in Python, but the
actual implementation is in C++/CUDA.  Upon summerdengfb's investigation,
Python has a known floating point representation issue
(https://www.geeksforgeeks.org/floating-point-error-in-python/). This
could cause quantization result discrepancy.  To workaround this
issue, we allow 1 bit difference in the FP8 quantization result (LSB
of mantissa) in `TestFused8BitRowwiseQuantizationConversion`.

Reviewed By: q10, shintaro-iwasaki

Differential Revision: D49255499

fbshipit-source-id: b28294f8076bda61589e10699119375f03b091a8
…ytorch#2016)

Summary:
Pull Request resolved: pytorch#2016

To alleviate CUDA version mismatch issues, we aim to publish fbgemm-gpu-nightly with different CUDA versions. This diff uses Nova workflow and will host the published wheels at PyTorch site instead.
pytorch#1947

Reviewed By: q10

Differential Revision: D49258503

fbshipit-source-id: a06d095b0c03df62d8cea8fb8db1b5018c9a9dd7
Summary:
Pull Request resolved: pytorch#2000

We split the diff after adding a needed lazy cuda init call in enable p2p access function.

Diff 1: D48939723
[PyTorch] Add the lazy init call for p2p access function

*Prior context*
cudaEnablePeerAccess only enables cross device access for memory allocated with cudaMalloc. When using other cuda APIs such cuMemMap, peer access is managed differently.
expandable_segments:True in PyTorch uses cuMemMap, so code that just calls cudaEnablePeerAccess is not sufficient to enable cross-device copies. This patch switching the p2p access enabling functions
to use PyTorchs `get_p2p_access` which lets its allocator figure out how to correctly enable p2p access for that memory.

In the normal case (expandable_segments:False), this code performs exactly the same cuda calls as before.

Reviewed By: zdevito

Differential Revision: D49021817

fbshipit-source-id: 7ffb4b477b1d1cddccc891dd9fc8f9a2a986585e
Summary:
Pull Request resolved: pytorch#2020

Update nova workflow to be triggered on pushes to main and releases branches and tags. Update wheel version name to match pytorch convention (e.g., +cpu, +cu118)

Reviewed By: q10

Differential Revision: D49296399

fbshipit-source-id: e4ed56c12dfb2f68d5b131b29c8ce0294e2a4522
Summary:
Pull Request resolved: pytorch#2019

As titled

Reviewed By: jianyuh

Differential Revision: D49296564

fbshipit-source-id: 442c13567cb7aa8de8c208c2ee1fb2ae550a8969
Summary:
Pull Request resolved: pytorch#2018

As titled

Reviewed By: jianyuh, henrylhtsang, edqwerty10

Differential Revision: D49295738

fbshipit-source-id: 45524d8e220ba6b686a99d201e24c6a3d839aed7
Summary:
Pull Request resolved: pytorch#2023

PyPI does not accept version to be specified with `+` (e.g., `+cpu` or `+cu118`).

Reviewed By: sryap

Differential Revision: D49329679

fbshipit-source-id: 1bc0c81ae15fdf157b54b2ca9226d67dacaade8e
…#2022)

Summary:
Pull Request resolved: pytorch#2022

Pull Request resolved: pytorch#2021

Nova creates conda environments with `--prefix` where the environments are outside of the default `envs` folder. Current script only works with `--name`. This diff changes the scripts to accommodate conda prefix.

https://github.com/pytorch/FBGEMM/actions/runs/6189731547

Reviewed By: spcyppt

Differential Revision: D49306286

fbshipit-source-id: 2f2ca00645526639369de3f555dbab30da56739e
Summary:
Pull Request resolved: pytorch#2024

Add arm builds through Nova to be hosted on pytorch.org

Reviewed By: osalpekar

Differential Revision: D49333436

fbshipit-source-id: 5d8c192afe10db5ed9b5c6bb35f823d66794e81a
…ch#2009)

Summary:
Pull Request resolved: pytorch#2009

For the case of loading FP8 tables with 128B * 8 < dimension <= 128B * 16, the numbers for (MinNum128BRows, MaxNum128BRows) should be (8, 16) and not (4, 8 ). Because of this bug, FP8 tables with dimension in that range don't get properly loaded upon codegen_forward. I wrote a little test to show this deterministically. I will remove the test before land if it is too specific.

Reviewed By: jianyuh

Differential Revision: D49139649

fbshipit-source-id: 300e844885e7a0f8e6e8193f03f135ceb76def28
cyyever and others added 28 commits October 5, 2023 22:39
Summary: Pull Request resolved: pytorch#2064

Reviewed By: spcyppt

Differential Revision: D49978436

Pulled By: q10

fbshipit-source-id: 74ec5bcc121b79774e20e14480a624e35f28f1e7
…t/sparse_ops_test.py (pytorch#2050)

Summary: Pull Request resolved: pytorch#2050

Reviewed By: zou3519

Differential Revision: D49736391

fbshipit-source-id: c0f7f66cef89365325726805a30cdfe232721af6
Summary:
Pull Request resolved: pytorch#2046

**Replace `memset` with vector store for the L=0 case (L = pooling factor)**

`memset` causes extra byte writes which results in lower performance.
The compiler seems to not be aware that the memory is aligned with the
cache line when using `memset`.

**Add a short circuit path for writing zeros when every bag in a feature
has L=0**

TBE v2 does an expensive explicit register-shared memory spill before
performing look up.  For L=0, these registers can be bypassed because
there is no look up to be performed.  We bypass the register spill if
all Ls are zeros.

Reviewed By: jasonjk-park

Differential Revision: D49605180

fbshipit-source-id: 66d5886a28bb8c75aa446f08ac1b165d6c298de6
Summary:
Pull Request resolved: pytorch#2051

Disable the subwarp optimization for weighted TBE

- The cause of the NE issue has to be investigated.
- This does not affect the overall E2E performance much since the
  majority of TBEs in the model is unweighted

Reviewed By: jasonjk-park

Differential Revision: D49159119

fbshipit-source-id: 805639d94d4ce7b3be8f275db4dfd0ecc95a539a
…ytorch#2061)

Summary:
Pull Request resolved: pytorch#2061

Before this diff, `keyed_jagged_index_select_dim1` kernels take raw
pointers as arguments.  This requires the input tensors to be
contiguous.  However, the `keyed_jagged_index_select_dim1` operator
did not make sure that the tensors are contiguous before extracting
and passing the raw pointers to the kernels causing the correctness
issue.  This diff replaces the raw pointer arguments with PyTorch's
`PackedTensorAccessor` which handles non-contiguous tensor accesses
automatically.  For some tensors that their raw pointers are still
being used, the operator makes sure that the tensors are contiguous
before using them.

Reviewed By: jasonjk-park, venkatrsrinivas

Differential Revision: D49939713

fbshipit-source-id: e941379eeda65fc998be6f506b0e467fe74a48a5
Summary:
Pull Request resolved: pytorch#1881

- Split up split_embeddings_cache_cuda.cu

Reviewed By: sryap, spcyppt

Differential Revision: D47491160

fbshipit-source-id: 9c025f06dbc003fe8734d64ae5a102b1a37fd5a5
Summary:
- Add retries to Python tests with delays in between retries

Pull Request resolved: pytorch#2071

Reviewed By: spcyppt

Differential Revision: D50207135

Pulled By: q10

fbshipit-source-id: 254a541f3007890ea6e11890552d52b6e6e44932
…ch#2069)

Summary:
Pull Request resolved: pytorch#2069

Add meta function for fbgemm::merge_pooled_embeddings operator to enable PT2 export of models containign this operator

Reviewed By: khabinov

Differential Revision: D49709130

fbshipit-source-id: 7c9bee0a82332f6ba78938d50c8bfd138051649f
…#2070)

Summary:
Pull Request resolved: pytorch#2070

This diff adds autogenerated opcheck tests to the jagged tensor operator tests.
We skip some torch.compile and gradcheck tests because those take an
unreasonable amount of time to run.

Reviewed By: ezyang

Differential Revision: D50129561

fbshipit-source-id: e03595488a7162c9cdff4a70ff06bc3d5ebad521
…s_test.py

Differential Revision:
D50129561

Original commit changeset: e03595488a71

Original Phabricator Diff: D50129561

fbshipit-source-id: 1b7ccc2c8e781cfdc635eeb7d83129335f071ce1
Summary:
Pull Request resolved: pytorch#2072

same as title.

Follow the same pattern with ``asynchronous_complete_cumsum``  in https://www.internalfb.com/diff/D47834850.

Reviewed By: ezyang

Differential Revision: D50195064

fbshipit-source-id: c2506688ef01c7e56392f05bc1d0d7ee9a856cab
Summary:
Setting `histogram_ps[RDX_HIST_SIZE * (nthreads - 1) + 127] = offset;` in `combine_prefix_sum_for_msb` is guaranteed to result in `heap-buffer-overflow` if bucket is not empty during the scatter stage (as all values of `histogram_ps` should be strictly less than `element_count`

Factor out common code from `RadixSortTest.cc` into `test_tempalte` and add regression test for buffer overflow, which before the test will fail as follows:
```
[ RUN      ] cpuKernelTest.raidx_sort_heap_overflow
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:36: Failure
Expected equality of these values:
  expected_keys
    Which is: { 2, 3, 5, -1, -1, 2147483647, 2147483647, 2147483647 }
  keys
    Which is: { -1, -1, -1, -1, -1, -1, -1, -1 }
/home/nshulga/git/pytorch/FBGEMM/test/RadixSortTest.cc:37: Failure
Expected equality of these values:
  expected_values
    Which is: { 1, 4, 6, 7, 8, 2, 3, 5 }
  values
    Which is: { 2147483647, 4, 6, 7, 8, 6, 7, 8 }
[  FAILED  ] cpuKernelTest.raidx_sort_heap_overflow (0 ms)
```

Will fix pytorch/pytorch#111189 once FBGEMM is updated to the correct version

Pull Request resolved: pytorch#2075

Reviewed By: kit1980, jianyuh

Differential Revision: D50256504

Pulled By: malfet

fbshipit-source-id: f805607595e324999cea07dcacdee8317a008221
…#2073)

Summary:
Pull Request resolved: pytorch#2073

This PR turns on automatically generated opcheck tests for
jagged_tensor_ops_test.py. I reverted the original one because the fbgemm OSS
CI failures were real.

Reviewed By: ezyang

Differential Revision: D50241452

fbshipit-source-id: 98192cbbb7d0e4e72c3cafd3c85ab6ad3a166da1
Summary:
Pull Request resolved: pytorch#2078

Check https://fb.workplace.com/groups/957614988799577/permalink/995501075010968/

Similar to D29056288 .

Didn't know `split_embedding_weights_with_scale_bias` is used in forward path in torchscript?

Reviewed By: q10, mengyingdu

Differential Revision: D50474951

fbshipit-source-id: d043bcad27a381eef90ddde5a58eb890a983a475
Summary:
Pull Request resolved: pytorch#2079

This diff shuffles the code to allow `batch_index_select_dim0` to get
total D from `D_offsets` when it is still in the host memory.

Reviewed By: q10, bigning

Differential Revision: D50477287

fbshipit-source-id: 715de7711e052baf27e417a93f9a4267697bcdbb
Summary:
Pull Request resolved: pytorch#2083

- Split up lfu_cache.cu for easier maintenance

Reviewed By: spcyppt

Differential Revision: D50522858

fbshipit-source-id: caa99dc973cabdb4fd61e85e23e3de1db568d788
…permute_1D_sparse_data (pytorch#2084)

Summary:
This also fixes a minor bug in GPU permute_1D_sparse_data where we need to clone the zero-size tensors to correctly setup (lack of) aliasing.

Pull Request resolved: pytorch#2084

Reviewed By: sryap

Differential Revision: D50563192

fbshipit-source-id: 1dc31580c54d8a0dfd3aadaf9b440636fd1a8550
Summary:
Pull Request resolved: pytorch#2082

See previous diff.

Reviewed By: zou3519

Differential Revision: D50509467

fbshipit-source-id: 7b4aab969ad4736e203c114cd13f379d05284501
Summary:
Pull Request resolved: pytorch#2088

ROCm 6.0 introduces backwards-incompatible changes such as removing the long-deprecated use of `__HIP_PLATFORM_HCC__`. It is better to use the USE_ROCM macro which is already defined and indicates a ROCm build. This PR also defines `__HIP_PLATFORM_AMD__` which is the new symbol name. This symbol is still required for compiling with HIP headers but when not using hip-clang.

Pull Request resolved: pytorch#2086

Reviewed By: sryap

Differential Revision: D50580075

Pulled By: q10

fbshipit-source-id: 6ceaa8bb36d64f2665001f9fb8afe2d4e431acb7
…window

Reviewed By: zou3519

Differential Revision: D50610665

fbshipit-source-id: 53bff35c79ae11634380d967e737bdf00ab2015e
Summary:
Pull Request resolved: pytorch#2085

- Migrate split embeddings cache ops registrations into the split_embeddings_cache directory

Reviewed By: spcyppt

Differential Revision: D50569476

fbshipit-source-id: 838efdf376abfd6f07104a05d59343e122038183
Summary: Pull Request resolved: pytorch#2087

Reviewed By: zou3519

Differential Revision: D50584541

fbshipit-source-id: 6adcb833e779ba35464bdb3e4ec0f13fbe514f8a
Summary: Pull Request resolved: pytorch#2090

Reviewed By: zou3519

Differential Revision: D50586828

fbshipit-source-id: 2f92a717877dfaf7b56fbea56df67acc272fd8f5
Summary:
Pull Request resolved: pytorch#2089

I'm not entirely sure about some of these (e.g., bin_ctr_in_use_after) but it should
be harmless enough to accept symbolic SymInt at all these sites.

Reviewed By: zou3519

Differential Revision: D50587015

fbshipit-source-id: 65484b2535753cd2732c80e16b7cc590ace9e723
Summary: Pull Request resolved: pytorch#2092

Reviewed By: zou3519

Differential Revision: D50613728

fbshipit-source-id: 5b3a8d91166a491147c728e5c024de9e118088ef
Summary:
Pull Request resolved: pytorch#2095

- Register fake CPU dispatch for split embeddings cache ops

Reviewed By: sryap

Differential Revision: D50665599

fbshipit-source-id: aca450e2e501e648ad07fb0ed8738ef5ec7e6bec
Summary: Pull Request resolved: pytorch#2097

Reviewed By: zou3519

Differential Revision: D50693694

fbshipit-source-id: 70b719d512a1de6f0ec49aca0fe4e794edb667d7
@liligwu liligwu self-assigned this Oct 31, 2023
@liligwu liligwu merged commit 83f4183 into main Nov 1, 2023
39 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.