[Disco] Switch to build-time sharding and enable FT quantization #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Building on @Lunderberg's work in mlc-ai#1096, we are now switching to build-time sharding in
mlc_serve
. Runtime sharding is no longer supported, and when building a model you must add--use-presharded-weights
. You also need the latestcontrib-vllm
.Build-time sharding also lets us support FT quantization with Disco. Now weight preprocessing is applied after sharding. It has been confirmed to work on 7B and 13B, with both
q4f16_ft
andq8f16_ft
, with--num-shards=2
. Other configs will be tested later.