Marlin downstream PR #13

alexm-neuralmagic · 2024-02-14T17:52:48Z

No description provided.

…anch safe_expose_semi_structured_sparse_tensor

Semi-structured 2:4 sparsity via SparseSemiStructuredTensor

…re (eager_force=False)

…size by running multiple parallel problems of size 64. (2) Refactor the workspace to be dynamic per layer

…d issues with tensor parallel runs)

cleanup to undo autoformatting

cleanup formatting

LucasWilkinson · 2024-02-14T18:34:03Z

csrc/ops.h

-  int block_size,
-  int max_context_len,
-  const c10::optional<torch::Tensor>& alibi_slopes);
+void paged_attention_v1(torch::Tensor &out, torch::Tensor &query,


we should probably avoid reformatting this file, it'll cause headaches later on when syncing with main vLLM repo

LucasWilkinson · 2024-02-14T18:34:38Z

vllm/model_executor/layers/parameters/sparsity.py

@@ -1,29 +1,35 @@
 import torch

-from magic_wand import SparseTensor, SparseBitmaskStorageFormat
+from typing import Type


why is this file changing? this seems unrelated to Marlin

LucasWilkinson · 2024-02-14T18:35:33Z

csrc/pybind.cpp

 #endif
  ops.def("gptq_gemm", &gptq_gemm, "Quantized GEMM for GPTQ");
  ops.def("gptq_shuffle", &gptq_shuffle, "Post processing for GPTQ");
  ops.def("squeezellm_gemm", &squeezellm_gemm, "Quantized GEMM for SqueezeLLM");
-
+  


nit: remove unnecessary format change

mgoin · 2024-02-14T19:34:29Z

vllm/config.py

@@ -148,9 +148,9 @@ def _verify_tokenizer_mode(self) -> None:
        self.tokenizer_mode = tokenizer_mode

    def _verify_sparsity(self) -> None:
-        supported_sparsity = ["sparse_w16a16"]
+        supported_sparsity = ["sparse_w16a16", "semi_structured_sparse_w16a16"]


please rebase/merge with our main properly, it seems like you've picked up some recent changes into this diff

robertgshaw2-neuralmagic · 2024-02-18T20:58:52Z

Closing in favor of #26

afeldman-nm and others added 30 commits February 1, 2024 23:41

.gitignore magic_wand dir

b8810c7

added 2:4 example (not actually using 2:4 yet\!)

d56b4c4

use only cuda:0

1a8bc1c

wip semi_structured_sparse_w16a16

2c6ff26

restructuring sparsity

2856b91

difficulty creating sparse parameter class

708fe1b

first successful run with 2:4 sparse model; compat with magic_wand br…

40a8afb

…anch safe_expose_semi_structured_sparse_tensor

Merge branch 'main' into semi_structured

017a296

woops uncommenting assert statement

a344b60

fixes

7a2a7ed

bfloat16

0711a74

hopefully removed magic_wand submodule

fc85cac

refactoring

ced7222

small cleanup

ded2c5b

small formatting fix

32fa245

Apply suggestions from code review

95303b3

lint/format

51ebca3

Merge pull request #4 from neuralmagic/semi_structured

6075c74

Semi-structured 2:4 sparsity via SparseSemiStructuredTensor

marlin

61b3f41

added marlin

06799db

trying to load packed weights turning out to be tricky

5b0311e

trying to load packed weights turning out to be tricky due to qkv

2018a52

integrated marlin for single gpu

6d72f3d

Update llama.py

0880ce3

Fixes to Marlin quantization to allow execution via CUDA graphs captu…

4b877a5

…re (eager_force=False)

Integrate @efrantar's changes for CUDA graphs

bd50dfb

review comments based on zhyncs

bc8d8bb

(1) Integrate the latest changes from Elias that improve large batch …

3c3f35a

…size by running multiple parallel problems of size 64. (2) Refactor the workspace to be dynamic per layer

add bug fix

6830042

refactored some of alex's work to be consistent with the gptq config

bf3a19b

robertgshaw2-neuralmagic and others added 12 commits February 14, 2024 12:48

updated to load model based on hf_config from AutoGPTQ

ffc19df

Reduce Marlin's kernel limitation of thread_n from 256 to 64 (to avoi…

88a8ea2

…d issues with tensor parallel runs)

Update checks related to MarlinConfig

c81b1df

formatting

9c29e08

Update ops.h

1f04049

cleanup to undo autoformatting

Update ops.h

54720de

cleanup formatting

readded marlin

715acc1

Bug fix for determination of the scales size in marlin layer

d15045e

Ensure marlin only compiles for GPU compute capability >= 8.0

60694b0

fix marlin compilation again

4303b89

merge fix

3527690

sync

18a23f2

alexm-neuralmagic self-assigned this Feb 14, 2024

alexm-neuralmagic requested review from robertgshaw2-neuralmagic, tlrmchlsmth, mgoin and LucasWilkinson February 14, 2024 18:26

LucasWilkinson reviewed Feb 14, 2024

View reviewed changes

mgoin reviewed Feb 14, 2024

View reviewed changes

robertgshaw2-neuralmagic closed this Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marlin downstream PR #13

Marlin downstream PR #13

alexm-neuralmagic commented Feb 14, 2024

LucasWilkinson Feb 14, 2024

LucasWilkinson Feb 14, 2024

LucasWilkinson Feb 14, 2024

mgoin Feb 14, 2024

robertgshaw2-neuralmagic commented Feb 18, 2024

Marlin downstream PR #13

Marlin downstream PR #13

Conversation

alexm-neuralmagic commented Feb 14, 2024

LucasWilkinson Feb 14, 2024

Choose a reason for hiding this comment

LucasWilkinson Feb 14, 2024

Choose a reason for hiding this comment

LucasWilkinson Feb 14, 2024

Choose a reason for hiding this comment

mgoin Feb 14, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Feb 18, 2024