New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactoring for maintainability #4

Merged

ElizaWszola merged 97 commits into neuralmagic:marlin-moe-integration from DhruvaBansal00:gptq-marlin-refactor

Aug 22, 2024

DhruvaBansal00 commented Aug 7, 2024

Refactoring Marlin MoE implementation for maintainability and mirroring AWQ codepath


          Refactoring for maintainability

e5c1a81

github-actions bot commented Aug 7, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀


          Fixing tests

7da678e

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe_gptq.py Outdated

		from .fused_moe import fused_topk, moe_align_block_size, try_get_optimal_moe_config


		def fused_moe_gptq(

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

can you call this fused_moe_marlin? We want to separate the naming of the kernel from the algorithm

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

-                                     hidden_size: int, intermediate_size: int,
-                                     params_dtype: torch.dtype, **extra_weight_attrs):
+                  def create_weights(

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024 •

edited

Loading

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

@@ @@ -386,7 +162,7 @@ def forward_tpu( @@
               class FusedMoE(torch.nn.Module):
                   """FusedMoE layer for MoE models.
-                  This layer contains both MergedColumnParallel weights (gate_up_proj /
+                  This layer contains both MergedColumnParallel weights (gate_up_proj /

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

@@ @@ -377,6 +152,7 @@ def forward_tpu( @@
                       topk_group: Optional[int],
                   ) -> torch.Tensor:
                       from vllm.model_executor.layers.fused_moe.moe_pallas import fused_moe

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

@@ @@ -491,8 +267,8 @@ def weight_loader(self, @@
                       else:
                           # Input scales can be loaded directly and should be equal.
                           if "input_scale" in weight_name:
-                              if param_data[expert_id] != 1 and (param_data[expert_id] -
-                                                                 loaded_weight).abs() > 1e-5:
+                              if (param_data[expert_id] != 1 and

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

    
                              if param_data[expert_id] != 1 and (param_data[expert_id] -

                                                                 loaded_weight).abs() > 1e-5:

                              if (param_data[expert_id] != 1 and

                                  (param_data[expert_id] - loaded_weight).abs() > 1e-5):

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

@@ @@ -546,7 +322,8 @@ def forward(self, hidden_states: torch.Tensor, @@
                           renormalize=self.renormalize,
                           use_grouped_topk=self.use_grouped_topk,
                           num_expert_group=self.num_expert_group,
-                          topk_group=self.topk_group)
+                          topk_group=self.topk_group,

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated

-                          ckpt_up_proj_name: str,
-                          num_experts: int) -> List[Tuple[str, str, int, int]]:
+                      cls,

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

nit: don't touch unchanged lines

robertgshaw2-neuralmagic reviewed

View reviewed changes

vllm/model_executor/layers/quantization/gptq_marlin.py Outdated

+                      num_expert_group: Optional[int] = None,
+                      topk_group: Optional[int] = None,
+                  ) -> torch.Tensor:
+                      if layer.marlin_state == GPTQMarlinState.REPACK:

Collaborator

robertgshaw2-neuralmagic Aug 8, 2024

Instead of this REPACK, please do the repacking in process_weights_after_loading

you can look at gptq_marlin.py for an example

Author

DhruvaBansal00 Aug 8, 2024

Addressed this comment. Also deduplicated code with marlin_utils

Collaborator

robertgshaw2-neuralmagic commented Aug 8, 2024

Thanks for the PR! this looks really good.

Other than spurious change nits, the key feedback is:

the repacking should happen in process_weights_after_loading. You can look at gptq_marlin.py for an example

Other thing - I wonder if there is a better way to do the make_expert_params_mapping --- but this could be a follow up

DhruvaBansal00 added 2 commits

August 8, 2024 10:16


          Addressing repacking comment

641696b


          gptq -> marlin renaming

3cef667

Author

DhruvaBansal00 commented Aug 8, 2024

I think the change nits happened because of the formatter I am using on save. Will fix it rn.
Other two comments should be addressed.

DhruvaBansal00 added 2 commits

August 8, 2024 10:46


          Undo formatting changes

a6710af


          Final formatting change

e29107f

Author

DhruvaBansal00 commented Aug 8, 2024

@robertgshaw2-neuralmagic resolved all formatting changes. Let me know if this is good to go!

DhruvaBansal00 requested a review from robertgshaw2-neuralmagic

August 9, 2024 01:19

DhruvaBansal00 added 10 commits

August 12, 2024 13:50


          Switching to mixtral file for quantized mixtral

099d61e


          Bug fixes

bdf6bdc


          is quantized change

19c5c59


          debug stat

3b7cc60


          replace wiehgt name with param name

d2c4754


          typo

f579cb2


          debug

79394eb


          more debug

ec75f4e


          only relevant logging

91ca970

log

1b9d5bb

DhruvaBansal00 added 13 commits

August 15, 2024 13:58


          parity with prev commits

67ce7b6


          Adding qzeros to mapping

bd933c9


          Remove log

77cd095


          Remove is quantized

529191e


          Assume fused true


          rm fused true

8cba45e


          Switching to mixtral moe

10940a5


          Precision changes

895ffbe


          Cleanup

e54b2e4


          Mixtral quant parity:

b4f23dc


          fixing tests

d59fe3b


          Tests working and correctness verified

0d9cbdc


          Formating

112aa40

Author

DhruvaBansal00 commented Aug 15, 2024

/ready

github-actions bot added the ready label

ElizaWszola reviewed

View reviewed changes

tests/kernels/test_moe.py Outdated

Comment on lines 13 to 14

		from vllm.model_executor.layers.fused_moe import fused_moe, single_marlin_moe
		from vllm.model_executor.layers.fused_moe.fused_moe_marlin import fused_moe_marlin

ElizaWszola Aug 19, 2024

nit: I think it would be good to keep single_marlin_moe in the same place as fused_moe_marlin, even if the former is only used for testing

ElizaWszola reviewed

View reviewed changes

vllm/model_executor/models/mixtral.py Outdated

@@ @@ -22,33 +22,49 @@ @@
               # limitations under the License.
               """Inference-only Mixtral model."""
               from typing import Iterable, List, Optional, Tuple
+              import re

ElizaWszola Aug 19, 2024

nit: can you make sure to remove all unused imports?

DhruvaBansal00 added 3 commits

August 19, 2024 10:32


          Moving single marlin alongside fused marlin

1ca9098


          Removing unused imports

4d41425


          single marlin moe import

4907f43

Author

DhruvaBansal00 commented Aug 19, 2024

@ElizaWszola thank you for the feedback! I have made the requested changes and also ran tests again. Hope things look good to merge now!

Would love to help expedite work on supporting 8-bit quantized models as well (these are returning incorrect outputs on my end). Happy to chat sometime!

DhruvaBansal00 requested a review from ElizaWszola

August 19, 2024 18:43

ElizaWszola commented Aug 20, 2024

This looks good in overall!

Just two small remaining things:

can you make sure that offline_inference.py is running to completion and producing sane output forllm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ", revision="gptq-4bit-128g-actorder_True")?
can you double check that the code conforms to the output of format.sh?

Collaborator

robertgshaw2-neuralmagic commented Aug 20, 2024

@ElizaWszola - I think we can merge this + take it from here

ElizaWszola added 2 commits

August 20, 2024 18:58


          Merge branch 'marlin-moe-integration' into gptq-marlin-refactor


          Unify shard_id to be of str w[1-3] format

315e3b6

ElizaWszola commented Aug 22, 2024

I'm merging this now. Thanks @DhruvaBansal00!

ElizaWszola merged commit 34bb5b0 into neuralmagic:marlin-moe-integration

LucasWilkinson pushed a commit that referenced this pull request


          Semi-structured 2:4 sparsity via SparseSemiStructuredTensor #4

4f8d12e

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

LucasWilkinson pushed a commit that referenced this pull request


          Semi-structured 2:4 sparsity via SparseSemiStructuredTensor #4

81dba47

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels