Skip to content

Commit

Permalink
Merge upstream nov7 (#52)
Browse files Browse the repository at this point in the history
* [API] Add GenerationConfig (mlc-ai#1024)

* Fix two bugs in kv-cache backtrack loop (mlc-ai#856)

Fix two bugs in kv-cache pop loop

Bug 1: old code would stop early because output_ids was shortened in-place during the loop

Bug 2: off-by-one in backoff size due to break

* [Build] Added --pdb flag to build.py, drop into pdb on error (mlc-ai#1017)

This commit adds an optional `--pdb` flag to the `build.py` script. If
passed, any exception raised that would otherwise terminate the script
will first enter a pdb post-mortem, allowing the error to be
inspected.

* [Android] Use `AlertDialog` instead of `Toast` (mlc-ai#1039)

* Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (mlc-ai#1040)

Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model

* [Android] Add Llama2 q4f16_0 (mlc-ai#1041)

llama2 q4f160

* [Docs] Model prebuilts tracking page revamp (mlc-ai#1000)

* Update compile_models.rst (mlc-ai#1038)

fix permission issue

* Support for the Stable LM 3B model (mlc-ai#1008)

Support for the stablelm-3b-4e1t model

* [Docs] Iterate model prebuilts docs (mlc-ai#1043)

* Iterate model prebuilts docs

* small fix

* Update README.md

* [CPP] Separate common utils out from llm_chat.cc (mlc-ai#1044)

This PR separates out the tokenizer creation function, the
random number generator out from `llm_chat.cc` as a preparation
step for batching inference support, since these functions/modules
are also used in the same way in batching inference.

* Update README.md (mlc-ai#1045)

Update README.md

* add verbose stats to mlc-chat REST API (mlc-ai#1049)

* add verbose stats to mlc-chat REST API

* update docs

* [Transform] Apply split_rotary optimization on prefill (mlc-ai#1033)

* [Transform] Apply split_rotary optimization on prefill

Prior to this commit, the `transform.fuse_split_rotary_embedding`
function was only applicable to the `decode` function of a Llama-type
model.  This was due to the sequence length being restricted to one,
both in the pattern-match rule and in the `split_rotary` function, and
the function being restricted to operate only on the `decode`
function.

This commit updates the `transform.fuse_split_rotary_embedding` pass
to be a `tvm.ir.transform.Pass`, operating on all applicable matched
in the `IRModule`.  The `split_rotary` function is now produced as a
fully-generic function, with static parameters substituted in
afterwards.  At this stage, the sequence length is retained as a
dynamic parameter, such that it can be used by the `prefill` function.

* Avoid multiple kernel launches for split_rotary

* [Docs] Add `mlc.ai/package` to `DEPENDENCY INSTALLATION` group (mlc-ai#1055)

Co-authored-by: Junru Shao <[email protected]>

* Revert "[Transform] Apply split_rotary optimization on prefill (mlc-ai#1033)" (mlc-ai#1058)

This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)

* [BugFix] Set the right `max_sequence_length` for both Llama-1 and Llama-2 families (mlc-ai#1032)

* fix

* reflect feedback

---------

Co-authored-by: “Sunghyun <[email protected]>

* [Doc] Use -U instead of --force-reinstall (mlc-ai#1062)

`--force-reinstall` will reinstall all dependencies to a python package,
which is unnecessary. `-U` is a better choice in this case.

* [Model] Initial batching support for Llama (mlc-ai#1048)

This PR introduces the initial batched input support for llama
models. To make the code managable, we keep both the single-sequence
handling flow and the batching handling flow in the Llama modeling.

Now, with `--enable-batching` as a build argument, we build Llama
for the batched version.

NOTE: The paged attention kernel/TIR func are not included in this PR,
so currently the built library with batching enabled is not runnable.
We will follow up with the attention kernel in the future.

This PR guarantees that the existing single-sequence inference (Python
API, CLI, etc.) is not broken.

P.S.. The batching flow is subject to bug fixes as we integrate with
the attention function and run the e2e flow in the future.

* Fix Stable LM 3B build (mlc-ai#1061)

* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size"

* Add get_num_key_value_heads method to StableLM3bConfig

* [Core] Remove duplication in MODEL.get_model calls (mlc-ai#1054)

This commit removes the `if`/`elif` chain in `core.py`, where the body
of each conditional assigns the same `mod, param_manager, params,
model_config`, and is identical except for the choice of model being
built.

* [ParamManager] Cleanup creation of quantization IRModule (mlc-ai#1053)

This commit replaces the single-parameter
`relax_model.param_manager.create_quantize_func` function with a
method on the `ParamManager`, `create_parameter_transformation`.  This
avoids potential typos between `param_manager` as the imported Python
module `mlc_llm.relax_model.param_manager` and an instance of the
`ParamManager` class named `param_manager`, and makes the
functionality easier to find.

This function also takes an optional `optimize_parameter_order` flag,
defaulting to `True`, which applies the `ReorderTransformFunc` pass.
Since the `ReorderTransformFunc` is intended to be used with several
configuration objects owned by `ParamManager`, this simplifies the
common path of producing an optimally-ordered parameter transformation
module.

* Minor typo fix (mlc-ai#1064)

* Add links to Python API Reference (mlc-ai#1068)

* [Fix] ChatModule incorrect temperature buffer shape (mlc-ai#1070)

PR mlc-ai#1048 updated the signature of softmax in the built model library
and changed the temperature buffer shape in ChatModule. This causes
some existing demo unable to run since we did not do a round of model
library update.

This PR reverts the ChatModule change, and adds back the softmax
function in non-batching case. With this PR, the regression should
be fixed.

* [ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063)

* [Python] Extract common device str parse function in ChatModule (mlc-ai#1074)

This PR lifts the device string parsing (just a few of lines)
to a standalone function, so that on the serving side the serving
can make use of this function as well.

Tested Python API and it does not seem to incur regression.

* [Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078)

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which
usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the
compute is based on fp32 instead. This PR strengthens the check guard.

* Establish `mlc_chat.compiler` (mlc-ai#1082)

This PR establishes the compiler components in MLC-Chat Python API,
which currently includes two primary components: models and parameters.

The models are `nn.Module`-based definition of an LLM, which, as the
very first stab, contains only `LlamaForCasualLM`. It is decomposed into
three files:
- `llama_config.py`: common configurations for Llama, where we define
  relevant configurations of its architecture, as well as include
  standard config file for Llama2-7B/13B/70B for convenient testing;
- `llama.py`: the model architecture of Llama, based on the PyTorch-like
`nn.Module` API;
- `llama_parameter.py`: defines the mapping between MLC parameters and
  pytorch parameters.

The parameters contains the basic functionality of parameter mapping,
and the loaders that effectively convert parameters from PyTorch to MLC
according to the mapping specified. Currently, only `HFTorchLoader` is
implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite
straightforward according to the existing design.

On top of this PR, on-the-fly quantization could be defined as a loading
time transformation on MLC parameters, while pre-quantized parameter
loading is effectively parameter loading after MLC's `nn.Module` is
quantized.

Two unittests examplify how the infrastructure works:
- `./tests/python/model/test_llama.py` shows how to create an `nn.Module`
using the new infra, and then convert it to TVM IRModule;
- `./tests/python/parameter/hf_torch_loader.py` shows how to load
parameters from HuggingFace PyTorch format.

Besides, `mlc_chat.support` is established for utility functions, which
now contains two utils:
- `config.py` which supports reading configurations into dataclasses
from JSON file or Python dict. On top of Python dataclass, it throws
irrelevant fields into `cls.kwargs`, which is helpful when loading
HuggingFace configuration file;
- `tqdm.py` which contains tqdm-related utilities, primarily redirecting
logging and printing to work nicely with tqdm.

* Update README.md for Multi-GPU (mlc-ai#1090)

* Support lib_path override in C++. Improvements on docs and error messages (mlc-ai#1086)

* Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages

* Update docs

* Rename lib_path -> model_lib_path

* StreamIterator (mlc-ai#1057)

Co-authored-by: Varshith <[email protected]>

* Update `benchmark.py` according to mlc-ai#1086 (mlc-ai#1091)

Update `benchmark.py`

* Disable Disco for q4f16_ft and q8f16_ft quantization (mlc-ai#1094)

* [Format] Apply isort and black for `python/` (mlc-ai#1097)

[Format] Apply isort and black on `python/`

The commands I am using are:

```
isort --profile black python/
black python/
```

It is always recommended to format the code before submission, given we
don't have a linter CI yet.

* More formatting (mlc-ai#1099)

* Enable Python Linter (mlc-ai#1098)

This PR enables two Python formatters "black" and "isort" on the following directory:
- `./python/`
- `./tests/python/`

Enabling pylint and mypy is left for future work

* Add Basic Pylint and Mypy Tooling (mlc-ai#1100)

Add pylint/mypy tooling into pyproject.toml

This PR establishes the initial Python tooling infra with Pylint and
Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and
`mlc_chat.compiler` are covered, and we expect to cover the entire
package, as being tracked in mlc-ai#1101.

* [CI] Add clang-format (mlc-ai#1103)

* [Slim-LM] Smart path finding for config and weight (mlc-ai#1088)

* [Transform] Provide IRModule transform for rewrite_attention (mlc-ai#1052)

Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a
single function.  This commit modifies it to instead be a transform
operating on any pattern matches within an `IRModule`.

* [ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)

* [ParamManager] Use BundleModelParams for transform_quantize

Prior to this commit, `ParamManager.transform_quantize` function took
as input functions with separate parameters for each weight tensor,
and produced output functions with a tuple parameter for all weights.
Because `LiftTransformParams` had the same convention, neither could
be applied as part of the same build flow.

This commit updates `ParamManager.transform_quantize` pass to produce
outputs with separate tensor parameters, using the `BundleModelParams`
transform to later combine them into a single tuple parameter.  The
analogous change was also performed for `LiftTransformParams` as part
of apache/tvm#15657.

In addition, prior to this commit, the
`ParamManager.transform_dequantize` function operated directly on a
`IRModule` object.  As a result, any debug instrumentation
(e.g. before/after printouts for each pass, before/after verification
with `relax.analysis.well_formed`, etc.) did not apply to this
`transform_dequantize`.  This commit updates
`ParamManager.transform_dequantize` to return a `ir.transform.Pass`.

* Correct type annotation

* [Slim-LM] Introduce HFLoad for loading Pytorch and SafeTensor weights (mlc-ai#1113)

* [WINDOWS] reduce noise in windows build (mlc-ai#1115)

* Add CLI commands for compilation (mlc-ai#1109)

* Auto updated submodule references

* fix mismatched argument name (mlc-ai#1117)

fix error introduced by recent code changes

fixes mlc-ai#1116

* [Docs] Add doc for max and mean gen len, shift factor; and buildArgs (mlc-ai#1119)

* Add doc for max and mean gen len, shift factor

* Update python docs for BuildArgs

* Revert "[ParamManager] Use BundleModelParams for transform_dequantize" (mlc-ai#1120)

Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)"

This reverts commit e5927ce.

This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment)

* Remove inaccurate warning message (mlc-ai#1121)

This PR removes an inaccurate warning from mlc-ai#1086, which warns about
`model_lib` overriding regardless of whether or not it's actually
overridden. With this commit, we only warn if its value is not None.

* [REST] OpenAI compatible Rest API (mlc-ai#1107)

* add presence and frequency penalty

* Added support for passing conversation history in /v1/chat/completions endpoint

* Added support for RestAPI parameters max_gen_len, n, and stop_str

* * add presence and frequency penalty to generation config
* refactor generation config

* Added documentation for parameters

* replace lib_path with model_lib_path in rest.py

* fixed black isort issues

* fix lib_path

* Add --opt flag parsing to CLI (mlc-ai#1123)

* [ParamManager][Redo] Use BundleModelParams for transform_dequantize (mlc-ai#1127)

Prior to this commit, `ParamManager.transform_quantize` function took
as input functions with separate parameters for each weight tensor,
and produced output functions with a tuple parameter for all weights.
Because `LiftTransformParams` had the same convention, neither could
be applied as part of the same build flow.

This commit updates `ParamManager.transform_quantize` pass to produce
outputs with separate tensor parameters, using the `BundleModelParams`
transform to later combine them into a single tuple parameter.  The
analogous change was also performed for `LiftTransformParams` as part
of apache/tvm#15657.

In addition, prior to this commit, the
`ParamManager.transform_dequantize` function operated directly on a
`IRModule` object.  As a result, any debug instrumentation
(e.g. before/after printouts for each pass, before/after verification
with `relax.analysis.well_formed`, etc.) did not apply to this
`transform_dequantize`.  This commit updates
`ParamManager.transform_dequantize` to return a `ir.transform.Pass`.

This commit is a repeat of the reverted PR
mlc-ai#1056.  This PR resolves the bug
in the earlier implementation by removing the call to
`.without_attr("num_input")` in `ParamReplacer.rewrite_func`.  This
follows an analogous update in `LiftTransformParams`, preserving the
`"num_input"` attribute for use in `BundleModelParams`.

* added details to windows installation (mlc-ai#1133)

32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.

* Grammatical and Typographical improvements (mlc-ai#1139)

* Update faq.rst

* Update guideline.rst

* Update compile_models.rst

* Update distribute_compiled_models.rst

* Update get-vicuna-weight.rst

* Update python.rst

* Update android.rst

* Update cli.rst

* Update ios.rst

* Update javascript.rst

* Update python.rst

* Update rest.rst

* Minor enhancements to `ChatModule` (mlc-ai#1132)

Some minor enhancements to `ChatModule`, mainly handle the device parsing solely in `_parse_device_str` instead of handling it both in the member function and the `__init__` function to avoid redundancy; and some type annotation fix.

* Updating tvm install docs (mlc-ai#1143)

Updating the tvm install docs to assist a user in finding and copying zstd.dll to the correct folder.

* Make the help info consistent with program name (mlc-ai#1137)

When user use command `mlc_chat_cli --help`, the output will be
something like

Usage: mlc_chat [--help] ...

That's because the program name specified in `cli_main.cc` is "mlc_chat".
It will be less confusing if the output of help info shows

Usage: mlc_chat_cli [--help] ...

* Support parameter packing (mlc-ai#1146)

* [Slim-LM] Enable Group Quant (mlc-ai#1129)

* Enable group quant via new interface.

* Minor fix.

* Linting.

* Fix isort.

* Fix mypy.

* TE compute working.

* Skip embed.

* Support cpu+gpu quantization.

* Add target option to tests.

* Linting.

* Enable Mypy and Pylint in mlc_chat Python Package (mlc-ai#1149)

* Migrate Compiler Passes (mlc-ai#1150)

* Compile Model Preset without External `config.json` (mlc-ai#1151)

This PR adds support for compiling a preset of models without
having to provide a `config.json` on disk using the commands below:

```diff
python -m mlc_chat.cli.compile \
       --quantization q4f16_1 -o /tmp/1.so \
-       --config /models/Llama-2-7b-chat-hf
+       --config llama2_7b
```

This allows easier testing and binary distribution without having to
depend on external model directory.

* Update attention layer (mlc-ai#1153)

Existing dlight optimization only works for NT matmul, but not NN. As a
result, the new `nn.Module`-based implementation, which uses NN matmul,
fails compilation at HEAD for now. This PR fixes this issue by tweaking
`k` to the preferred layout.

The following commands now work with the new compilation pipeline:

```bash
python -m mlc_chat.cli.compile --config llama2_7b  --quantization q4f16_1 -o /tmp/1.so
python -m mlc_chat.cli.compile --config llama2_13b --quantization q4f16_1 -o /tmp/1.so
python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so
```

Note that the quantization algorithm per se, `q4f16_1`, has not been
implemented yet, meaning this code path is not yet ready for use so far.

* Add batched Llama model definition using vLLM paged attention (mlc-ai#1134)

* Add batched Llama model with vllm paged attention

* update core.py

* doc

* minor

* add e2e test

* mv file

* clean

* Check if TVM has been built with USE_VLLM

* update BuildArgs docstring

* [Transform][Redo] Apply split_rotary optimization on prefill (mlc-ai#1125)

Prior to this commit, the `transform.fuse_split_rotary_embedding`
function was only applicable to the `decode` function of a Llama-type
model.  This was due to the sequence length being restricted to one,
both in the pattern-match rule and in the `split_rotary` function, and
the function being restricted to operate only on the `decode`
function.

This commit updates the `transform.fuse_split_rotary_embedding` pass
to be a `tvm.ir.transform.Pass`, operating on all applicable matched
in the `IRModule`.  The `split_rotary` function is now produced as a
fully-generic function, with static parameters substituted in
afterwards.  At this stage, the sequence length is retained as a
dynamic parameter, such that it can be used by the `prefill` function.

This commit reapplies the reverted commit
mlc-ai#1033.  The error in the
previous implementation was in the definition of
`rotary_embedding_offset`, which provided the `query_sequence_length`
instead of `kv_sequence_length`.  This was able to pass the validity
tests described
[here](mlc-ai#1058 (comment)),
as these two sequence lengths are identical for the first call.

* Apply rewrite for normal attention and MQA (mlc-ai#1138)

Fixes a bug introduced in mlc-ai#1052,
where use of the `--use-flash-attn-mqa` flag on a model that doesn't
use MQA would prevent the use of CUTLASS attention at all.

* [Rest] Fix emoji handling in Rest API. (mlc-ai#1142)

* [Utility] Check for isinstance(exc, Exception) before entering pdb (mlc-ai#1095)

This is a follow-up to mlc-ai#1017, which added a `--pdb` flag to enter a
debugger on exit.  This commit checks the type of the raised
exception, and only enters the debugger if it is a subclass of
`Exception`.  This ensures that implementation-details, such as a
thrown `SystemExit` or `KeyboardInterrupt`, do not cause an erroneous
entry to pdb.

* [Utils] Remove conversion to numpy array in utils.save_params (mlc-ai#1083)

Prior to this commit, each parameter was converted to a numpy-owned
array as part of a total size computation.  This commit computes the
size directly, removing the conversion.

* [Fix][REST] Use lowered-cased "app" (mlc-ai#1159)

* [Rest] Document emoji handling (mlc-ai#1160)

Followup PR of mlc-ai#1142 to document the emoji handling.

* Enable group quant transform with nn.Module (mlc-ai#1154)

* Enable group quant transform with nn.Module

This PR completes the group quantization support for `nn.Module` based model.

* remove deprecated tests

* Update

* wip

* remove deprecated test

* fix lint

* fix lint

* fix lint

---------

Co-authored-by: Junru Shao <[email protected]>

* Misc Cleanups of Compilation Pipeline (mlc-ai#1165)

* Support CUDA Multi-Arch Compilation (mlc-ai#1166)

* [Bugfix] Cannot find global function `mlc.llm_chat_create` (mlc-ai#1167)

* Fix RWKV Support (mlc-ai#1136)

I successfully ran the rwkv-world-3b fp16 model on my Xiaomi phone. This PR is to fix a bug on the main branch where the rwkv model outputs only one word and then stop.

![image](https://github.com/mlc-ai/mlc-llm/assets/35585791/6514d6ef-c93c-4ad2-8e76-8ffa0663080f)

* Auto updated submodule references

* Fix Android app Permission denied error on Android 10  (mlc-ai#1175)

Use scoped storage instead of Downloads directory

Co-authored-by: Animesh Bohara <[email protected]>

* [SLM] Fix group quantization (mlc-ai#1172)

This PR fixes the group quantization and add related unit tests.

* [Fix] TIR block name of dequantization (mlc-ai#1177)

* [SLM][AutoLLM] Enable Command Line Weight Conversion (mlc-ai#1170)

This PR enables weight conversion in command line.
Sample command: `python3 -m mlc_chat.cli.convert_weight --config dist/models/llama-2-13b-chat-hf/ --quantization "q4f16_1" --output dist/test/`

* [Fix][SLM] Update q4f16 quantization with the new mutator name rule (mlc-ai#1178)

[Fix] Update q4f16 quantization with the new mutator name rule

* [Model Support][SWA] Add support for sliding window attention for Mistral (mlc-ai#1087)

* mistral base

* Add sliding window mask making and its tests

* Small changes for sliding window mask

* Clean up mask making

* Remove kv_seq_len

* Add prefill chunking, handle max window size in SWA

* Add interleave kv

* Temporary fix for kv seq len

* Pass in more shapes to SWA prefill and decode in runtime

* mistral var fix

* Small changes regarding shape passing

* Small fix on chunk size

* Add build args, fix mlc chat config dump

* mistral system prompt
---------

Co-authored-by: David Pissarra <[email protected]>
Co-authored-by: David Pissarra <[email protected]>

* Add Python API for Weight Conversion (mlc-ai#1182)

This PR primarily does a major refactoring to introduce Python API that
is consistent with the CLI API. Besides, it includes the following
fixes and enhancements:

- More info provided to `isort` for better formatting in `pyproject.toml`;
- Print out the default value of all arguments in argparse command line;
- Ensure `--device` is always available locally when doing weight
  conversion;
- Add argument echoing in weight conversion to be consistent with its
  counterpart in compilation;
- Add a consistency checker to make sure the shapes/dtypes of all
  tensors from weight conversion is consistent with compilation;
- Echo the total size of parameters;
- Better logging of each parameter's shape and dtype, and either or not
  its quantized;
- More structure robustification, renaming `parameter/` to `loader/` to
  be more explicit about its intention;
- Inline and remove `ParamQuantizer` into the loader to improve logging
  and the logic flow;
- Always add instructions "Use `--xxx` to override" for any options that
  are auto detected to be more informative to end users;
- Fix wrong shape calculation when quantizing `nn.Embedding`;
- Fix wrong dtype calculation in group quantization when the input dtype
  is different from model dtype (e.g. "float32" in torch, but the model
  dtype in quantization is fp16 in `q4f16_1`);
- Fix inconsistent param names in layers such as `GroupQuantizeLinear`;
- Fix dtype inconsistency when a parameter is not quantized;
- Fix existing unittests.

* Merge `llama_config.CONFIG` into `MODEL_PRESETS` (mlc-ai#1188)

* Merge llama_config.py into llama_model.py (mlc-ai#1189)

* Add CodeLlama as part of model presets (mlc-ai#1190)

* [Docs] Clarify zstd installation on Windows (mlc-ai#1191)

* [Docs] Clarify zstd installation on Windows (mlc-ai#1196)

Update zstd installation

* Support overriding `--max-sequence-length` in command line (mlc-ai#1197)

* [RestAPI] Added docs (mlc-ai#1193)

Add docs for RestAPI

Co-authored-by: Animesh Bohara <[email protected]>

* [API] ```llm-vscode``` extension support (mlc-ai#1198)

This PR enables ```llm-vscode``` extension API support for copilot-like code completion, following [HF's LSP](https://github.com/huggingface/llm-ls). Fully compatible with ```CodeLlama``` and ```starcoder``` on mlc-llm. 

- huggingface/llm-vscode#103 enhances extension user experience when used with mlc-llm rest api.

Thanks @ pacman100, who came up with this on his latest blogpost: https://huggingface.co/blog/personal-copilot

* [Fix] Use `fabs` as floating point abs function in C++ (mlc-ai#1202)

* Integrating MLC runtime with the new compilation workflow (mlc-ai#1203)

* [Fix] Remove Redundant Warnings (mlc-ai#1204)

PR mlc-ai#1203 introduces some unnecessary and redundant logging messages.
This PR gets them removed.

* Try fix macOS build with picojson (mlc-ai#1206)

The error message below

```
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h: In member function 'std::string picojson::value::to_str() const':
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:494:37: error: expected ')' before 'PRId64'
  494 |       SNPRINTF(buf, sizeof(buf), "%" PRId64, u_.int64_);
      |               ~                     ^~~~~~~
      |                                     )
/usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:81:1: note: 'PRId64' is defined in header '<cinttypes>'; did you forget to '#include <cinttypes>'?
   80 | #include <errno.h>
  +++ |+#include <cinttypes>
   81 | #include <inttypes.h>

```

indicates that the `__STDC_FORMAT_MACROS` flag is not turned on for some
reason.

* Try fix macOS build with picojson again (mlc-ai#1207)

Try fix macOS build with picojson

* Auto updated submodule references

* [Fix] Keep update-to-date with upstream API change (mlc-ai#1209)

* Detect `mtriple` via LLVM (mlc-ai#1211)

* Fix Python3.8 compatibility breakage (mlc-ai#1210)

The breakage was resulting from newer syntax being used for type
annotations, as part of mlc-ai#592.
So long as `mlc_chat.interface.openai_api` wasn't imported, the
breaking changes were not encountered.  In
mlc-ai#1107, the addition of `from
.interface.openai_api import ChatMessage` caused this module to be
imported, breaking compatibility of `mlc_chat.ChatModule` with
Python3.8.

This commit updates the type annotations to the supported syntax.

* [Slim-LM] Enable loading from AWQ pre-quantized weight. (mlc-ai#1114)

* [SLM] Enable loading from AWQ pre-quantized weight.

* remove awq_loader.py

* Update to the latest commit

* Delete llama_parameter.py

* update unittest

* fix lint

* upd

* add Llama-2-7B-AWQ

* fix

* rm

---------

Co-authored-by: David Pissarra <[email protected]>
Co-authored-by: Roee Shenberg <[email protected]>
Co-authored-by: Eric Lunderberg <[email protected]>
Co-authored-by: Yaxing Cai <[email protected]>
Co-authored-by: Charlie Ruan <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: yongjer <[email protected]>
Co-authored-by: Jeethu Rao <[email protected]>
Co-authored-by: Junru Shao <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Co-authored-by: Denise Kutnick <[email protected]>
Co-authored-by: Lesheng Jin <[email protected]>
Co-authored-by: Junru Shao <[email protected]>
Co-authored-by: Sunghyun Park <[email protected]>
Co-authored-by: “Sunghyun <[email protected]>
Co-authored-by: Rick Zhou <[email protected]>
Co-authored-by: Varshith Bathini <[email protected]>
Co-authored-by: Varshith <[email protected]>
Co-authored-by: Tianqi Chen <[email protected]>
Co-authored-by: Git bot <[email protected]>
Co-authored-by: SingLi <[email protected]>
Co-authored-by: Kartik Khandelwal <[email protected]>
Co-authored-by: Goutham Tamilselvan <[email protected]>
Co-authored-by: S A G A R <[email protected]>
Co-authored-by: Yuchen Jin <[email protected]>
Co-authored-by: DavidSharma <[email protected]>
Co-authored-by: fennecJ <[email protected]>
Co-authored-by: Xiyou Zhou <[email protected]>
Co-authored-by: Xiaoyu Zhang <[email protected]>
Co-authored-by: Animesh Bohara <[email protected]>
Co-authored-by: Animesh Bohara <[email protected]>
Co-authored-by: David Pissarra <[email protected]>
  • Loading branch information
1 parent 84297df commit f369d7f
Show file tree
Hide file tree
Showing 59 changed files with 5,157 additions and 637 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ class AppViewModel(application: Application) : AndroidViewModel(application) {
val url = URL("${modelUrl}${ModelUrlSuffix}${ModelConfigFilename}")
val tempId = UUID.randomUUID().toString()
val tempFile = File(
Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS),
application.getExternalFilesDir(Environment.DIRECTORY_DOWNLOADS),
tempId
)
url.openStream().use {
Expand Down
4 changes: 4 additions & 0 deletions cpp/conv_templates.cc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ Conversation Llama2() {
Conversation MistralDefault() {
Conversation conv;
conv.name = "mistral_default";
conv.system =
("[INST] Always assist with care, respect, and truth. Respond with utmost utility yet "
"securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies "
"promote fairness and positivity.");
conv.roles = {"[INST]", "[/INST]"};
conv.messages = {};
conv.offset = 0;
Expand Down
6 changes: 3 additions & 3 deletions cpp/image_embed.cc
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@
#include "image_embed.h"

#include <picojson.h>
#include <tvm/runtime/memory/memory_manager.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/ndarray.h>
#include <tvm/runtime/registry.h>
#include <tvm/runtime/relax_vm/memory_manager.h>

#include <cctype>
#include <chrono>
Expand Down Expand Up @@ -59,9 +59,9 @@ class LLMImage {
ICHECK(fload_exec.defined()) << "TVM runtime cannot find vm_load_executable";
vm_ = fload_exec();
vm_->GetFunction("vm_initialization")(static_cast<int>(device_.device_type), device_.device_id,
static_cast<int>(relax_vm::AllocatorType::kPooled),
static_cast<int>(memory::AllocatorType::kPooled),
static_cast<int>(kDLCPU), 0,
static_cast<int>(relax_vm::AllocatorType::kPooled));
static_cast<int>(memory::AllocatorType::kPooled));

embed_func_ = vm_->GetFunction("embed");

Expand Down
61 changes: 61 additions & 0 deletions cpp/json_parser.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#ifndef MLC_LLM_CPP_JSON_PARSER_H_
#define MLC_LLM_CPP_JSON_PARSER_H_

#define PICOJSON_USE_INT64
#define __STDC_FORMAT_MACROS

#include <picojson.h>
#include <tvm/runtime/container/shape_tuple.h>
#include <tvm/runtime/data_type.h>
#include <tvm/runtime/logging.h>

namespace mlc {
namespace llm {
namespace json {

template <typename ValueType>
inline ValueType Lookup(const picojson::object& json, const std::string& key) {
auto it = json.find(key);
CHECK(it != json.end()) << "ValueError: key `" << key << "` not found in the JSON object";
CHECK(it->second.is<ValueType>()) << "ValueError: key `" << key << "` has unexpected type";
return it->second.get<ValueType>();
}

template <>
inline tvm::runtime::DataType Lookup(const picojson::object& json, const std::string& key) {
return tvm::runtime::DataType(tvm::runtime::String2DLDataType(Lookup<std::string>(json, key)));
}

template <>
inline tvm::runtime::ShapeTuple Lookup(const picojson::object& json, const std::string& key) {
picojson::array shape = Lookup<picojson::array>(json, key);
std::vector<int64_t> result;
result.reserve(shape.size());
for (const picojson::value& dim : shape) {
CHECK(dim.is<int64_t>()) << "ValueError: key `" << key << "` has unexpected type";
result.push_back(dim.get<int64_t>());
}
return tvm::runtime::ShapeTuple(std::move(result));
}

inline picojson::object ParseObject(const std::string& json_str) {
picojson::value result;
std::string err = picojson::parse(result, json_str);
if (!err.empty()) {
LOG(FATAL) << "Failed to parse JSON: err. The JSON string is:" << json_str;
}
CHECK(result.is<picojson::object>())
<< "ValueError: The given string is not a JSON object: " << json_str;
return result.get<picojson::object>();
}

inline picojson::object AsJSONObject(const picojson::value& json) {
CHECK(json.is<picojson::object>()) << "ValueError: The given value is not a JSON object";
return json.get<picojson::object>();
}

} // namespace json
} // namespace llm
} // namespace mlc

#endif // MLC_LLM_CPP_JSON_PARSER_H_
132 changes: 110 additions & 22 deletions cpp/llm_chat.cc
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@
#include <tokenizers_cpp.h>
#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/disco/session.h>
#include <tvm/runtime/memory/memory_manager.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/ndarray.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>
#include <tvm/runtime/relax_vm/memory_manager.h>

#include <cctype>
#include <chrono>
Expand All @@ -32,6 +32,7 @@
#include <vector>

#include "conversation.h"
#include "model_metadata.h"
#include "random.h"
#include "support.h"
#include "tokenizers.h"
Expand Down Expand Up @@ -158,16 +159,18 @@ struct FunctionTable {
this->local_vm = fload_exec();
this->local_vm->GetFunction("vm_initialization")(
static_cast<int>(device.device_type), device.device_id,
static_cast<int>(relax_vm::AllocatorType::kPooled), static_cast<int>(kDLCPU), 0,
static_cast<int>(relax_vm::AllocatorType::kPooled));
static_cast<int>(memory::AllocatorType::kPooled), static_cast<int>(kDLCPU), 0,
static_cast<int>(memory::AllocatorType::kPooled));
this->mod_get_func = [this](const std::string& name) -> PackedFunc {
return this->local_vm->GetFunction(name, false);
PackedFunc func = this->local_vm->GetFunction(name, false);
return func;
};
this->get_global_func = [](const std::string& name) -> PackedFunc {
const auto* f = tvm::runtime::Registry::Get(name);
CHECK(f != nullptr) << "ValueError: Cannot find function " << name;
return *f;
};
this->model_metadata_ = ModelMetadata::FromModule(this->local_vm);
this->_InitFunctions();
}
}
Expand All @@ -188,10 +191,23 @@ struct FunctionTable {
const PackedFunc* fload_cache = tvm::runtime::Registry::Get("vm.builtin.ndarray_cache.load");
ICHECK(fload_cache) << "TVM runtime cannot find vm.builtin.ndarray_cache.load";
(*fload_cache)(model_path, static_cast<int32_t>(device.device_type), device.device_id);
const PackedFunc* fload_params =
tvm::runtime::Registry::Get("vm.builtin.param_array_from_cache");
ICHECK(fload_params) << "Cannot find env function vm.builtin.param_array_from_cache";
Array<NDArray> params = (*fload_params)("param", -1);
Array<NDArray> params;
if (this->model_metadata_.params.empty()) {
constexpr const char* name_loader = "vm.builtin.param_array_from_cache";
const PackedFunc* fload_params = tvm::runtime::Registry::Get(name_loader);
ICHECK(fload_params) << "Cannot find env function: " << name_loader;
params = (*fload_params)("param", -1);
} else {
constexpr const char* name_loader = "vm.builtin.param_array_from_cache_by_name";
const PackedFunc* fload_params = tvm::runtime::Registry::Get(name_loader);
ICHECK(fload_params) << "Cannot find env function: " << name_loader;
Array<String> param_names;
param_names.reserve(this->model_metadata_.params.size());
for (const auto& param : this->model_metadata_.params) {
param_names.push_back(param.name);
}
params = (*fload_params)(param_names);
}
// after we get params, it is safe to simply clear the cached version
// as these params are referenced by params_
const PackedFunc* fclear_ndarray_cache =
Expand All @@ -210,6 +226,9 @@ struct FunctionTable {
this->softmax_func_ = mod_get_func("softmax_with_temperature");
this->encoding_without_cache_func_ = mod_get_func("encoding_without_cache");
this->create_kv_cache_func_ = mod_get_func("create_kv_cache");
if (this->create_kv_cache_func_ == nullptr) {
this->create_kv_cache_func_ = mod_get_func("_initialize_effect");
}
this->reset_kv_cache_func_ = mod_get_func("reset_kv_cache");
if (this->reset_kv_cache_func_ == nullptr) {
this->reset_kv_cache_func_ = get_global_func("vm.builtin.attention_kv_cache_array_clear");
Expand Down Expand Up @@ -260,6 +279,7 @@ struct FunctionTable {
PackedFunc reset_kv_cache_func_;
bool support_backtracking_kv_;
PackedFunc fkvcache_array_popn_;
ModelMetadata model_metadata_;
};

} // namespace
Expand Down Expand Up @@ -295,6 +315,9 @@ class LLMChat {
if (ft_.use_disco) {
return false;
}
if (this->sliding_window_ != -1) {
return false;
}
PackedFunc fget_metadata = ft_.mod_get_func("get_metadata");
if (fget_metadata == nullptr) {
return false;
Expand Down Expand Up @@ -369,6 +392,16 @@ class LLMChat {
this->max_window_size_ =
std::min(this->max_window_size_, config["max_window_size"].get<int64_t>());
}
if (config.count("sliding_window")) {
CHECK(config["sliding_window"].is<int64_t>());
CHECK(!config.count("max_window_size"))
<< "Cannot specify both sliding_window and max_window_size.";
this->sliding_window_ = config["sliding_window"].get<int64_t>();
}
if (config.count("sliding_window_chunk_size")) {
CHECK(config["sliding_window_chunk_size"].is<int64_t>());
this->sliding_window_chunk_size_ = config["sliding_window_chunk_size"].get<int64_t>();
}
if (config.count("model_name")) {
CHECK(config["model_name"].is<std::string>());
this->model_name_ = config["model_name"].get<std::string>();
Expand Down Expand Up @@ -462,9 +495,11 @@ class LLMChat {
// so there is no explicit abi dependency on these extra
// classes other than basic tvm runtime.
this->ft_.Init(reload_lib, device_, this->num_shards_);
UpdateMaxWindowSizeFromMetadata();
CHECK(max_window_size_ != std::numeric_limits<int64_t>::max())
<< "Key \"max_window_size\" not found.";
if (this->sliding_window_ == -1) {
UpdateMaxWindowSizeFromMetadata();
CHECK(max_window_size_ != std::numeric_limits<int64_t>::max())
<< "Key \"max_window_size\" not found.";
}
// Step 4. Initialize sample functions.
auto fsample_topp_from_prob_ptr =
tvm::runtime::Registry::Get("vm.builtin.sample_top_p_from_prob");
Expand Down Expand Up @@ -562,7 +597,8 @@ class LLMChat {
std::string all_prompt = GetConcatPrompt(prompts, 0, 0);
std::vector<int32_t> encoded = this->tokenizer_->Encode(all_prompt);
tokens.insert(tokens.end(), encoded.begin(), encoded.end());
if (this->total_seq_len_ + tokens.size() + gen_mean_gen_len < this->max_window_size_) {
if (this->sliding_window_ != -1 || // There is no max window size if we use sliding window
this->total_seq_len_ + tokens.size() + gen_mean_gen_len < this->max_window_size_) {
return tokens;
}
// need shift window and re-encode
Expand Down Expand Up @@ -753,6 +789,10 @@ class LLMChat {
if (ft_.use_disco) {
LOG(FATAL) << "NotImplementedError: Distributed inference is not supported for this model";
}
if (this->sliding_window_ != -1) {
LOG(FATAL)
<< "NotImplementedError: Sliding window attention does not support separate embedding";
}
NDArray embedding = Downcast<NDArray>(
EmbedStep(inp, append_conversation, place_in_prompt, generation_config_str));
PrefillWithEmbedStep(embedding, decode_next_token, generation_config_str);
Expand All @@ -772,8 +812,28 @@ class LLMChat {
}
auto tstart = std::chrono::high_resolution_clock::now();

int32_t new_seq_len = total_seq_len_ + token_len;
NDArray logits_on_device = this->ForwardTokens(prompt_tokens, new_seq_len);
int32_t new_seq_len = total_seq_len_;
NDArray logits_on_device;
if (this->sliding_window_ != -1) {
// Use chunking if we use sliding window attention (see Mistral paper figure 3).
int64_t sliding_window_chunk_size = this->sliding_window_chunk_size_;
if (this->sliding_window_chunk_size_ == -1) {
// One chunk if chunk size not specified
sliding_window_chunk_size = token_len;
}
for (int64_t begin = 0; begin < token_len; begin += sliding_window_chunk_size) {
int64_t end = std::min(token_len, begin + sliding_window_chunk_size);
std::vector<int32_t> chunk =
std::vector<int32_t>(prompt_tokens.begin() + begin, prompt_tokens.begin() + end);
new_seq_len += static_cast<int64_t>(chunk.size());
logits_on_device = this->ForwardTokens(chunk, new_seq_len);
}
ICHECK_EQ(new_seq_len, total_seq_len_ + token_len) << "Expect chunking process all tokens";
} else {
// Otherwise, prefill entire prompt at once.
new_seq_len += token_len;
logits_on_device = this->ForwardTokens(prompt_tokens, new_seq_len);
}
total_seq_len_ = new_seq_len;

if (!decode_next_token) {
Expand Down Expand Up @@ -957,15 +1017,15 @@ class LLMChat {
}
if (generation_config.count("presence_penalty")) {
CHECK(generation_config["presence_penalty"].is<double>());
CHECK(abs(generation_config["presence_penalty"].get<double>()) <= 2)
CHECK(fabs(generation_config["presence_penalty"].get<double>()) <= 2)
<< "Presence penalty must be in the range -2 to 2!";
*gen_presence_penalty = generation_config["presence_penalty"].get<double>();
} else {
*gen_presence_penalty = this->presence_penalty_;
}
if (generation_config.count("frequency_penalty")) {
CHECK(generation_config["frequency_penalty"].is<double>());
CHECK(abs(generation_config["frequency_penalty"].get<double>()) <= 2)
CHECK(fabs(generation_config["frequency_penalty"].get<double>()) <= 2)
<< "Frequency penalty must be in the range -2 to 2!";
*gen_frequency_penalty = generation_config["frequency_penalty"].get<double>();
} else {
Expand Down Expand Up @@ -1108,7 +1168,12 @@ class LLMChat {

if (static_cast<int64_t>(output_ids_.size()) >= gen_max_gen_len) {
stop_triggered_ = true;
} else if (total_seq_len_ >= max_window_size_) {
}
// max_window_size_ != -1 to handle
// https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/rwkv.py#L588-L589
// sliding_window_ == -1 to make sure we do not stop when using sliding window
else if (max_window_size_ != -1 && sliding_window_ == -1 &&
total_seq_len_ >= max_window_size_) {
stop_triggered_ = true;
}
if (stop_triggered_) {
Expand All @@ -1122,7 +1187,18 @@ class LLMChat {
if (input_tokens.size() > 1 && ft_.prefill_func_.defined()) {
ObjectRef input_data = ft_.CopyToWorker0(this->GetInputTokenNDArray(input_tokens));
ShapeTuple cur_pos_shape = ShapeTuple({cur_pos});
ret = ft_.prefill_func_(input_data, cur_pos_shape, kv_cache_, params_);
if (sliding_window_ == -1) {
ret = ft_.prefill_func_(input_data, cur_pos_shape, kv_cache_, params_);
} else {
// Sliding window attention needs extra shape parameters
int64_t seq_len = static_cast<int64_t>(input_tokens.size());
// Number of elements in the cache
int64_t cache_len = std::min(this->sliding_window_, cur_pos - seq_len);
ShapeTuple cache_len_shape = ShapeTuple({cache_len});
ShapeTuple kv_seq_len_shape = ShapeTuple({cache_len + seq_len});
ret = ft_.prefill_func_(input_data, cur_pos_shape, cache_len_shape, kv_seq_len_shape,
kv_cache_, params_);
}
} else {
// running decode function when prefill is not available
for (int i = 0; i < input_tokens.size(); ++i) {
Expand All @@ -1135,8 +1211,19 @@ class LLMChat {
input_data = ft_.CopyToWorker0(this->GetInputTokenNDArray({input_tokens[i]}));
}
int64_t pos = cur_pos + i + 1 - input_tokens.size();
ShapeTuple pos_shape = ShapeTuple({cur_pos});
ret = ft_.decode_func_(input_data, pos_shape, kv_cache_, params_);
ShapeTuple pos_shape = ShapeTuple({pos});
if (sliding_window_ == -1) {
ret = ft_.decode_func_(input_data, pos_shape, kv_cache_, params_);
} else {
// Sliding window attention needs extra shape parameters
int64_t seq_len = static_cast<int64_t>(input_tokens.size());
// Number of elements in the cache
int64_t cache_len = std::min(this->sliding_window_, pos - seq_len);
ShapeTuple cache_len_shape = ShapeTuple({cache_len});
ShapeTuple kv_seq_len_shape = ShapeTuple({cache_len + seq_len});
ret = ft_.decode_func_(input_data, pos_shape, cache_len_shape, kv_seq_len_shape,
kv_cache_, params_);
}
}
}
if (ft_.use_disco) {
Expand Down Expand Up @@ -1262,9 +1349,10 @@ class LLMChat {
Conversation conversation_;
// total sequence len,
int64_t total_seq_len_{0};
// max window size, mean generation length
// max window size, mean and max generation length, sliding window
// If we use sliding window, max window size is its default max() value
int64_t max_window_size_{std::numeric_limits<int64_t>::max()}, mean_gen_len_{128},
max_gen_len_{512};
max_gen_len_{512}, sliding_window_{-1}, sliding_window_chunk_size_{-1};
// size of the vocab table
int64_t vocab_size_;
// number of shards in distributed inference
Expand Down
Loading

0 comments on commit f369d7f

Please sign in to comment.