Upstream merge oct31 #41

masahi · 2023-10-30T23:58:44Z

Note that this merges mlc-ai/main, not mlc-ai/serving.

Workflow change:

Command line: --batched -> --enable-batching --use-vllm-attention
The artifact path doesn't end with -batched postfix.

Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break

…1017) This commit adds an optional `--pdb` flag to the `build.py` script. If passed, any exception raised that would otherwise terminate the script will first enter a pdb post-mortem, allowing the error to be inspected.

…ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model

llama2 q4f160

fix permission issue

Support for the stablelm-3b-4e1t model

* Iterate model prebuilts docs * small fix

This PR separates out the tokenizer creation function, the random number generator out from `llm_chat.cc` as a preparation step for batching inference support, since these functions/modules are also used in the same way in batching inference.

Update README.md

* add verbose stats to mlc-chat REST API * update docs

* [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary

…i#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com>

…i#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)

…ma-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com>

`--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case.

This PR introduces the initial batched input support for llama models. To make the code managable, we keep both the single-sequence handling flow and the batching handling flow in the Llama modeling. Now, with `--enable-batching` as a build argument, we build Llama for the batched version. NOTE: The paged attention kernel/TIR func are not included in this PR, so currently the built library with batching enabled is not runnable. We will follow up with the attention kernel in the future. This PR guarantees that the existing single-sequence inference (Python API, CLI, etc.) is not broken. P.S.. The batching flow is subject to bug fixes as we integrate with the attention function and run the e2e flow in the future.

* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig

This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built.

This commit replaces the single-parameter `relax_model.param_manager.create_quantize_func` function with a method on the `ParamManager`, `create_parameter_transformation`. This avoids potential typos between `param_manager` as the imported Python module `mlc_llm.relax_model.param_manager` and an instance of the `ParamManager` class named `param_manager`, and makes the functionality easier to find. This function also takes an optional `optimize_parameter_order` flag, defaulting to `True`, which applies the `ReorderTransformFunc` pass. Since the `ReorderTransformFunc` is intended to be used with several configuration objects owned by `ParamManager`, this simplifies the common path of producing an optimally-ordered parameter transformation module.

PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed.

…ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression.

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm.

fix error introduced by recent code changes fixes mlc-ai#1116

…lc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs

mlc-ai#1120) Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)" This reverts commit e5927ce. This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment)

This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None.

* add presence and frequency penalty * Added support for passing conversation history in /v1/chat/completions endpoint * Added support for RestAPI parameters max_gen_len, n, and stop_str * * add presence and frequency penalty to generation config * refactor generation config * Added documentation for parameters * replace lib_path with model_lib_path in rest.py * fixed black isort issues * fix lib_path

…lc-ai#1127) Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. This commit is a repeat of the reverted PR mlc-ai#1056. This PR resolves the bug in the earlier implementation by removing the call to `.without_attr("num_input")` in `ParamReplacer.rewrite_func`. This follows an analogous update in `LiftTransformParams`, preserving the `"num_input"` attribute for use in `BundleModelParams`.

32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.

* Update faq.rst * Update guideline.rst * Update compile_models.rst * Update distribute_compiled_models.rst * Update get-vicuna-weight.rst * Update python.rst * Update android.rst * Update cli.rst * Update ios.rst * Update javascript.rst * Update python.rst * Update rest.rst

Some minor enhancements to `ChatModule`, mainly handle the device parsing solely in `_parse_device_str` instead of handling it both in the member function and the `__init__` function to avoid redundancy; and some type annotation fix.

Updating the tvm install docs to assist a user in finding and copying zstd.dll to the correct folder.

When user use command `mlc_chat_cli --help`, the output will be something like Usage: mlc_chat [--help] ... That's because the program name specified in `cli_main.cc` is "mlc_chat". It will be less confusing if the output of help info shows Usage: mlc_chat_cli [--help] ...

* Enable group quant via new interface. * Minor fix. * Linting. * Fix isort. * Fix mypy. * TE compute working. * Skip embed. * Support cpu+gpu quantization. * Add target option to tests. * Linting.

This PR adds support for compiling a preset of models without having to provide a `config.json` on disk using the commands below: ```diff python -m mlc_chat.cli.compile \ --quantization q4f16_1 -o /tmp/1.so \ - --config /models/Llama-2-7b-chat-hf + --config llama2_7b ``` This allows easier testing and binary distribution without having to depend on external model directory.

Existing dlight optimization only works for NT matmul, but not NN. As a result, the new `nn.Module`-based implementation, which uses NN matmul, fails compilation at HEAD for now. This PR fixes this issue by tweaking `k` to the preferred layout. The following commands now work with the new compilation pipeline: ```bash python -m mlc_chat.cli.compile --config llama2_7b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_13b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so ``` Note that the quantization algorithm per se, `q4f16_1`, has not been implemented yet, meaning this code path is not yet ready for use so far.

…#1134) * Add batched Llama model with vllm paged attention * update core.py * doc * minor * add e2e test * mv file * clean * Check if TVM has been built with USE_VLLM * update BuildArgs docstring

…1125) Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. This commit reapplies the reverted commit mlc-ai#1033. The error in the previous implementation was in the definition of `rotary_embedding_offset`, which provided the `query_sequence_length` instead of `kv_sequence_length`. This was able to pass the validity tests described [here](mlc-ai#1058 (comment)), as these two sequence lengths are identical for the first call.

Fixes a bug introduced in mlc-ai#1052, where use of the `--use-flash-attn-mqa` flag on a model that doesn't use MQA would prevent the use of CUTLASS attention at all.

…lc-ai#1095) This is a follow-up to mlc-ai#1017, which added a `--pdb` flag to enter a debugger on exit. This commit checks the type of the raised exception, and only enters the debugger if it is a subclass of `Exception`. This ensures that implementation-details, such as a thrown `SystemExit` or `KeyboardInterrupt`, do not cause an erroneous entry to pdb.

…#1083) Prior to this commit, each parameter was converted to a numpy-owned array as part of a total size computation. This commit computes the size directly, removing the conversion.

davidpissarra and others added 30 commits October 7, 2023 22:36

[API] Add GenerationConfig (mlc-ai#1024)

898db76

Fix two bugs in kv-cache backtrack loop (mlc-ai#856)

ad3a6b9

Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break

[Android] Use AlertDialog instead of Toast (mlc-ai#1039)

bae37b3

Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (mlc-…

b44f679

…ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model

[Android] Add Llama2 q4f16_0 (mlc-ai#1041)

3a9849a

llama2 q4f160

[Docs] Model prebuilts tracking page revamp (mlc-ai#1000)

bed9e60

Update compile_models.rst (mlc-ai#1038)

c02fdaf

fix permission issue

Support for the Stable LM 3B model (mlc-ai#1008)

85001ed

Support for the stablelm-3b-4e1t model

[Docs] Iterate model prebuilts docs (mlc-ai#1043)

a032d40

* Iterate model prebuilts docs * small fix

Update README.md

a58605f

Update README.md (mlc-ai#1045)

20131fb

Update README.md

add verbose stats to mlc-chat REST API (mlc-ai#1049)

1e6fb11

* add verbose stats to mlc-chat REST API * update docs

[Docs] Add mlc.ai/package to DEPENDENCY INSTALLATION group (mlc-a…

98ebd28

…i#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com>

Revert "[Transform] Apply split_rotary optimization on prefill (mlc-a…

bfaa5b9

…i#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)

[BugFix] Set the right max_sequence_length for both Llama-1 and Lla…

ca8c11b

…ma-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com>

[Doc] Use -U instead of --force-reinstall (mlc-ai#1062)

edab9b5

`--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case.

Fix Stable LM 3B build (mlc-ai#1061)

c2b8cbc

* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig

[Core] Remove duplication in MODEL.get_model calls (mlc-ai#1054)

481cd92

This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built.

Minor typo fix (mlc-ai#1064)

9010d48

Add links to Python API Reference (mlc-ai#1068)

b0bfc88

[ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063)

d202077

[Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078)

3aefd9f

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

Git bot and others added 29 commits October 24, 2023 07:30

Auto updated submodule references

8ce7793

fix mismatched argument name (mlc-ai#1117)

488017d

fix error introduced by recent code changes fixes mlc-ai#1116

[Docs] Add doc for max and mean gen len, shift factor; and buildArgs (m…

206103b

…lc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs

Remove inaccurate warning message (mlc-ai#1121)

9cb8e8e

This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None.

Add --opt flag parsing to CLI (mlc-ai#1123)

a4279e3

added details to windows installation (mlc-ai#1133)

24f795e

32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version.

Minor enhancements to ChatModule (mlc-ai#1132)

2ec0cc8

Some minor enhancements to `ChatModule`, mainly handle the device parsing solely in `_parse_device_str` instead of handling it both in the member function and the `__init__` function to avoid redundancy; and some type annotation fix.

Updating tvm install docs (mlc-ai#1143)

27ac5ac

Updating the tvm install docs to assist a user in finding and copying zstd.dll to the correct folder.

Support parameter packing (mlc-ai#1146)

878ae84

[Slim-LM] Enable Group Quant (mlc-ai#1129)

c0c3a8d

* Enable group quant via new interface. * Minor fix. * Linting. * Fix isort. * Fix mypy. * TE compute working. * Skip embed. * Support cpu+gpu quantization. * Add target option to tests. * Linting.

Enable Mypy and Pylint in mlc_chat Python Package (mlc-ai#1149)

2193767

Migrate Compiler Passes (mlc-ai#1150)

0a25374

Add batched Llama model definition using vLLM paged attention (mlc-ai…

fee2cb5

…#1134) * Add batched Llama model with vllm paged attention * update core.py * doc * minor * add e2e test * mv file * clean * Check if TVM has been built with USE_VLLM * update BuildArgs docstring

Apply rewrite for normal attention and MQA (mlc-ai#1138)

b190578

Fixes a bug introduced in mlc-ai#1052, where use of the `--use-flash-attn-mqa` flag on a model that doesn't use MQA would prevent the use of CUTLASS attention at all.

[Rest] Fix emoji handling in Rest API. (mlc-ai#1142)

8ca0176

[Utils] Remove conversion to numpy array in utils.save_params (mlc-ai…

0a9d6c7

…#1083) Prior to this commit, each parameter was converted to a numpy-owned array as part of a total size computation. This commit computes the size directly, removing the conversion.

[Fix][REST] Use lowered-cased "app" (mlc-ai#1159)

425a2cb

sync llama def

ed10571

Merge remote-tracking branch 'mlc-ai/main' into upstream-merge-oct31

bab2e50

fi

2252c66

masahi merged commit ba6d4ea into octoml:batch-serving Oct 30, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream merge oct31 #41

Upstream merge oct31 #41

masahi commented Oct 30, 2023 •

edited

Loading

Upstream merge oct31 #41

Upstream merge oct31 #41

Conversation

masahi commented Oct 30, 2023 • edited Loading

masahi commented Oct 30, 2023 •

edited

Loading