Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Disco] Switch to build-time sharding and enable FT quantization #55

Merged
merged 7 commits into from
Nov 7, 2023

Conversation

masahi
Copy link
Member

@masahi masahi commented Nov 7, 2023

Building on @Lunderberg's work in mlc-ai#1096, we are now switching to build-time sharding in mlc_serve. Runtime sharding is no longer supported, and when building a model you must add --use-presharded-weights. You also need the latest contrib-vllm.

Build-time sharding also lets us support FT quantization with Disco. Now weight preprocessing is applied after sharding. It has been confirmed to work on 7B and 13B, with both q4f16_ft and q8f16_ft, with --num-shards=2. Other configs will be tested later.

@masahi masahi merged commit 9c006fd into octoml:batch-serving Nov 7, 2023
@Lunderberg
Copy link
Member

As an addendum, the build.py script must be run in two steps, once with --convert-weight-only and once with --build-model-only. This is implemented in this check, and is due to the same parameter size handling described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants