[Disco] Switch to build-time sharding and enable FT quantization #55

masahi · 2023-11-07T20:27:35Z

Building on @Lunderberg's work in mlc-ai#1096, we are now switching to build-time sharding in mlc_serve. Runtime sharding is no longer supported, and when building a model you must add --use-presharded-weights. You also need the latest contrib-vllm.

Build-time sharding also lets us support FT quantization with Disco. Now weight preprocessing is applied after sharding. It has been confirmed to work on 7B and 13B, with both q4f16_ft and q8f16_ft, with --num-shards=2. Other configs will be tested later.

Lunderberg · 2023-11-07T21:45:12Z

As an addendum, the build.py script must be run in two steps, once with --convert-weight-only and once with --build-model-only. This is implemented in this check, and is due to the same parameter size handling described here.

masahi added 7 commits November 7, 2023 01:29

wip

09533b0

works with ThreadedSession

6285f1a

wip

89dfe4d

wip

8149ea2

automatically set use-presharded-weights for FT

1330ac3

conditionally apply preprocessing

7ed660e

fix

1c65fa6

masahi merged commit 9c006fd into octoml:batch-serving Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Disco] Switch to build-time sharding and enable FT quantization #55

[Disco] Switch to build-time sharding and enable FT quantization #55

masahi commented Nov 7, 2023 •

edited

Loading

Lunderberg commented Nov 7, 2023

[Disco] Switch to build-time sharding and enable FT quantization #55

[Disco] Switch to build-time sharding and enable FT quantization #55

Conversation

masahi commented Nov 7, 2023 • edited Loading

Lunderberg commented Nov 7, 2023

masahi commented Nov 7, 2023 •

edited

Loading