Sparse fused gemm integration #12

LucasWilkinson · 2024-02-14T02:14:44Z

Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need to ensure that we compress the weight matrix only once and never decompress it, as decompression is currently unsupported.

Before this change, using SparseParameter(SparseTensor) meant that in MergedColumnParallelLinear and QKVParallelLinear every time a new shard was loaded by the weight_loader (e.g., the "q" portion of QKVParallelLinear), we would decompress the tensor in-order to use narrow to update the appropriate section of the weight tensor. With this change, SparseParameter(SparseTensor) is replaced with LazyCompressedParameter, which allows us to operate on uncompressed_data until we explicitly compress it. At that point, the uncompressed_data is compressed into compressed_data and freed. Currently, the detection of when to call compress is somewhat hacky. For QKVParallelLinear, we compress only after inserting "q", "k", and "v" shard ids, and for MergedColumnParallelLinear, we compress once we've inserted the same number of shards as outputs (determined by len(output_sizes)), which implicitly assumes one shard per output.

Moving away from SparseParameter(SparseTensor) means that SparseTensor no longer handles dispatching to the custom ops; instead, this is handled by SparseW16A16LinearMethod. I believe this is a positive change overall. SparseTensor was an unnecessary extra layer of abstraction/indirection originally designed for the SLoRA work, not vLLM.

This did result in the 2:4 sparse implementation breaking. However, it turns out it was already broken (i.e., it was decompressing and running dense within SparseTensor), so we "disable" it for now ("disable" meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

…anch safe_expose_semi_structured_sparse_tensor

afeldman-nm · 2024-02-14T06:31:17Z

Thanks for the heads-up Re: 2-4 @LucasWilkinson . The torch.Tensor dispatch mechanism is I agree unwieldy for our purposes (a bit too powerful for what we want to do, given that in vLLM we are almost exclusively dispatching to the linear operator.) I agree moving past it is a good choice. For example, the failure of the compressed 2:4 kernel to be invoked by vLLM is an issue with how I integrated 2:4 into SparseTensor's dispatch.

In a separate PR I will address the issue with the 2:4 integration. For now deactivating 2:4 is a good choice.

vllm/model_executor/layers/linear.py

Summary: Initial integration for the sparse-fused gemm. To achieve this, we need to ensure that we compress the weight matrix only once and never decompress it, as decompression is currently unsupported. Before this change, using `SparseParameter(SparseTensor)` meant that in `MergedColumnParallelLinear` and `QKVParallelLinear` every time a new shard was loaded by the `weight_loader` (e.g., the "q" portion of `QKVParallelLinear`), we would decompress the tensor in-order to use narrow to update the appropriate section of the weight tensor. With this change, `SparseParameter(SparseTensor)` is replaced with `LazyCompressedParameter`, which allows us to operate on `uncompressed_data` until we explicitly compress it. At that point, the `uncompressed_data` is compressed into `compressed_data` and freed. Currently, the detection of when to call compress is somewhat hacky. For `QKVParallelLinear`, we compress only after inserting "q", "k", and "v" shard ids, and for `MergedColumnParallelLinear`, we compress once we've inserted the same number of shards as outputs (determined by `len(output_sizes)`), which implicitly assumes one shard per output. Moving away from `SparseParameter(SparseTensor)` means that `SparseTensor` no longer handles dispatching to the custom ops; instead, this is handled by `SparseW16A16LinearMethod`. I believe this is a positive change overall. `SparseTensor` was an unnecessary extra layer of abstraction/indirection originally designed for the SLoRA work, not vLLM. This did result in the 2:4 sparse implementation breaking. However, it turns out it was already broken (i.e., it was decompressing and running dense within `SparseTensor`), so we "disable" it for now ("disable" meaning decompress and run dense instead). We should revisit all of this infrastructure post-MVP. --------- Co-authored-by: Andrew Feldman <[email protected]>

afeldman-nm and others added 19 commits February 1, 2024 23:41

.gitignore magic_wand dir

b8810c7

added 2:4 example (not actually using 2:4 yet\!)

d56b4c4

use only cuda:0

1a8bc1c

wip semi_structured_sparse_w16a16

2c6ff26

restructuring sparsity

2856b91

difficulty creating sparse parameter class

708fe1b

first successful run with 2:4 sparse model; compat with magic_wand br…

40a8afb

…anch safe_expose_semi_structured_sparse_tensor

Merge branch 'main' into semi_structured

017a296

woops uncommenting assert statement

a344b60

fixes

7a2a7ed

bfloat16

0711a74

hopefully removed magic_wand submodule

fc85cac

wip bench

d7b2f41

initial integration

ef64711

disable the semi-sparse stuff temporarily

202e655

Merge branch 'main' into lwilkinson/sparse-fused-gemm-integration

7f67d62

format fix

131a0a5

remove sparse benchmark

5c6a55e

small format fix

ae57f2c

LucasWilkinson marked this pull request as ready for review February 14, 2024 03:55

LucasWilkinson requested review from mgoin, afeldman-nm, tlrmchlsmth, robertgshaw2-neuralmagic and alexm-neuralmagic February 14, 2024 03:56

LucasWilkinson added 3 commits February 13, 2024 23:19

remove useless comments

fb95394

cleanup spacing

b5ffb39

revert

9b69f56

missed pack

1fbc82f

mgoin reviewed Feb 14, 2024

View reviewed changes

vllm/model_executor/layers/linear.py Show resolved Hide resolved

mgoin approved these changes Feb 14, 2024

View reviewed changes

LucasWilkinson merged commit 4f225b4 into main Feb 14, 2024
2 checks passed

LucasWilkinson deleted the lwilkinson/sparse-fused-gemm-integration branch February 14, 2024 22:53

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse fused gemm integration #12

Sparse fused gemm integration #12

LucasWilkinson commented Feb 14, 2024 •

edited

Loading

afeldman-nm commented Feb 14, 2024 •

edited

Loading

Sparse fused gemm integration #12

Sparse fused gemm integration #12

Conversation

LucasWilkinson commented Feb 14, 2024 • edited Loading

afeldman-nm commented Feb 14, 2024 • edited Loading

LucasWilkinson commented Feb 14, 2024 •

edited

Loading

afeldman-nm commented Feb 14, 2024 •

edited

Loading