Upstream merge Dec 01 #94

masahi · 2023-12-01T08:16:53Z

No description provided.

Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break

…1017) This commit adds an optional `--pdb` flag to the `build.py` script. If passed, any exception raised that would otherwise terminate the script will first enter a pdb post-mortem, allowing the error to be inspected.

…ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model

llama2 q4f160

fix permission issue

Support for the stablelm-3b-4e1t model

* Iterate model prebuilts docs * small fix

This PR separates out the tokenizer creation function, the random number generator out from `llm_chat.cc` as a preparation step for batching inference support, since these functions/modules are also used in the same way in batching inference.

Update README.md

* add verbose stats to mlc-chat REST API * update docs

* [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary

…i#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com>

…i#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)

…ma-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com>

`--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case.

This PR introduces the initial batched input support for llama models. To make the code managable, we keep both the single-sequence handling flow and the batching handling flow in the Llama modeling. Now, with `--enable-batching` as a build argument, we build Llama for the batched version. NOTE: The paged attention kernel/TIR func are not included in this PR, so currently the built library with batching enabled is not runnable. We will follow up with the attention kernel in the future. This PR guarantees that the existing single-sequence inference (Python API, CLI, etc.) is not broken. P.S.. The batching flow is subject to bug fixes as we integrate with the attention function and run the e2e flow in the future.

* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig

This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built.

This commit replaces the single-parameter `relax_model.param_manager.create_quantize_func` function with a method on the `ParamManager`, `create_parameter_transformation`. This avoids potential typos between `param_manager` as the imported Python module `mlc_llm.relax_model.param_manager` and an instance of the `ParamManager` class named `param_manager`, and makes the functionality easier to find. This function also takes an optional `optimize_parameter_order` flag, defaulting to `True`, which applies the `ReorderTransformFunc` pass. Since the `ReorderTransformFunc` is intended to be used with several configuration objects owned by `ParamManager`, this simplifies the common path of producing an optimally-ordered parameter transformation module.

PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed.

…ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression.

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm.

This PR fixes the broken CI due to different tasks sharing the same workspace.

* generalize `prefill-chunk-size` * renaming `cache_len` to `rolling_cache_len` * [nn.Module] generalize `prefill_chunk_size` * quick fix * lint fix * check sw with chunking * fix `_attach_variable_bounds` * update config from lib metadata * cleanup cleanup * metadata fix

compatible for chatglm

* Add q4/q8_ft_group quantization mode * Update submodule

Cleaning the workspace before building, so that the previous corrupted directory will not affect the current CI build. Used the `cleanWS` from here https://www.jenkins.io/doc/pipeline/steps/ws-cleanup/

This PR separates the device detection into separate subprocesses. The change is because the device detection will setup the driver, which consumes some GPU VRAM (for example, `tvm.device("opencl", 0).exist` consumes 390MB of VRAM on RTX 4090). Consider the case we detect if CUDA, Vulkan and OpenCL are available. When they are all available, each detection holds some VRAM, larger than 430MB altogether. If the device detection is in the same process as the main process, the VRAM consumed by device detection will never be released. This means that in the example above, we detect CUDA, Vulkan and OpenCL, while in the end we prioritize the CUDA device. Consequently, the memory held by Vulkan and OpenCL detection will never be released. Motivated by this issue, we separate the detection into subprocess, so that the held VRAM can be successfully released after detection.

* feat: add chatglm3 support * fix: remove duplicate code --------- Co-authored-by: Max Lin <jason.max.lin@outlook.com>

Co-authored-by: Charlie Ruan <charlieruan2001130@gmail.com>

* remove ndk referencce from mali target This removes the ndk reference for mali targets when building * adding check for NDK environment variable --------- Co-authored-by: x <x@xs-MacBook-Pro.local>

a few enhancements.

* Add terminator for streaming REST API Add compatibility with OpenAI's streaming API. fixes mlc-ai#1300 * make black happy

…1326) read {TORCH_,}CUDA_ARCH_LIST from environment and use it to set compute arch versions

This PR updates the device auto detection to use in process early exit. This hopefully will address some of the windows issues in multi-process based approach while preserving memory without allocating in each during auto mode.

…#1335) Add n and stop for /v1/completions endpoint Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

) fix broken restapi examples due to recent OpenAI API compatibility changes

TVM's Vulkan runtime emits a non-zero exit code on certain Windows drivers on DLL offloading. While there is definitely a way to fix this, for now, we quickly get around this by not checking the exit code in device detection. This PR also improves clarify when multiple GPUs presents by emitting logging messages on all GPUs available, rather than only GPU 0.

…#1354) Add conversation template for open hermes mistral Co-authored-by: David Pissarra <david.pissarra@tecnico.ulisboa.pt>

…t included (mlc-ai#1352) * Prioritize tokenzer json, generate one if not included

Prepare for publishing.

* fix gen config for mistral * fix pylint * Allow override of sw and chunk size in config gen * lint fix --------- Co-authored-by: Charlie Ruan <charlieruan2001130@gmail.com> Co-authored-by: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com>

davidpissarra and others added 30 commits October 7, 2023 22:36

[API] Add GenerationConfig (mlc-ai#1024)

898db76

Fix two bugs in kv-cache backtrack loop (mlc-ai#856)

ad3a6b9

Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break

[Android] Use AlertDialog instead of Toast (mlc-ai#1039)

bae37b3

Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (mlc-…

b44f679

…ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model

[Android] Add Llama2 q4f16_0 (mlc-ai#1041)

3a9849a

llama2 q4f160

[Docs] Model prebuilts tracking page revamp (mlc-ai#1000)

bed9e60

Update compile_models.rst (mlc-ai#1038)

c02fdaf

fix permission issue

Support for the Stable LM 3B model (mlc-ai#1008)

85001ed

Support for the stablelm-3b-4e1t model

[Docs] Iterate model prebuilts docs (mlc-ai#1043)

a032d40

* Iterate model prebuilts docs * small fix

Update README.md

a58605f

Update README.md (mlc-ai#1045)

20131fb

Update README.md

add verbose stats to mlc-chat REST API (mlc-ai#1049)

1e6fb11

* add verbose stats to mlc-chat REST API * update docs

[Docs] Add mlc.ai/package to DEPENDENCY INSTALLATION group (mlc-a…

98ebd28

…i#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com>

Revert "[Transform] Apply split_rotary optimization on prefill (mlc-a…

bfaa5b9

…i#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)

[BugFix] Set the right max_sequence_length for both Llama-1 and Lla…

ca8c11b

…ma-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com>

[Doc] Use -U instead of --force-reinstall (mlc-ai#1062)

edab9b5

`--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case.

Fix Stable LM 3B build (mlc-ai#1061)

c2b8cbc

* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig

[Core] Remove duplication in MODEL.get_model calls (mlc-ai#1054)

481cd92

This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built.

Minor typo fix (mlc-ai#1064)

9010d48

Add links to Python API Reference (mlc-ai#1068)

b0bfc88

[ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063)

d202077

[Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078)

3aefd9f

The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.

MasterJH5574 and others added 29 commits November 20, 2023 11:12

[CI] Fix lint concurrent clone issue (mlc-ai#1299)

01daa64

This PR fixes the broken CI due to different tasks sharing the same workspace.

Auto updated submodule references

418b9a9

Compatible with chatglm (mlc-ai#979)

488f65d

compatible for chatglm

Add q4/q8_ft_group quantization mode (mlc-ai#1284)

2fd1bf5

* Add q4/q8_ft_group quantization mode * Update submodule

[CI] Clean workspace before build (mlc-ai#1304)

5d96740

Cleaning the workspace before building, so that the previous corrupted directory will not affect the current CI build. Used the `cleanWS` from here https://www.jenkins.io/doc/pipeline/steps/ws-cleanup/

add chatglm3 support (mlc-ai#1313)

9641676

* feat: add chatglm3 support * fix: remove duplicate code --------- Co-authored-by: Max Lin <jason.max.lin@outlook.com>

[SLIM] Skip None param when loading rather than failing (mlc-ai#1308)

95f9abe

Auto updated submodule references

9e28540

[nn.Module] Implement GPT-2 Model Support (mlc-ai#1314)

53f2747

Co-authored-by: Charlie Ruan <charlieruan2001130@gmail.com>

remove ndk referencce from mali build target (mlc-ai#1312)

b561810

* remove ndk referencce from mali target This removes the ndk reference for mali targets when building * adding check for NDK environment variable --------- Co-authored-by: x <x@xs-MacBook-Pro.local>

[Rust] A few enhancements (mlc-ai#1310)

13759fd

a few enhancements.

[iOS] Mistral support (mlc-ai#1320)

48df439

Add terminator for streaming REST API (mlc-ai#1325)

da07940

* Add terminator for streaming REST API Add compatibility with OpenAI's streaming API. fixes mlc-ai#1300 * make black happy

read CUDA_ARCH_LIST to set CUDA capability versions for nvcc (mlc-ai#…

992ed42

…1326) read {TORCH_,}CUDA_ARCH_LIST from environment and use it to set compute arch versions

Update emcc.rst

3358029

[RestAPI] Update parameters for /v1/completions and add tests (mlc-ai…

e7d2ce6

…#1335) Add n and stop for /v1/completions endpoint Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>

fix broken REST examples due to recent compatibility change (mlc-ai#1345

5dc809e

) fix broken restapi examples due to recent OpenAI API compatibility changes

[OpenHermes] Add conversation template for OpenHermes Mistral (mlc-ai…

02a41e1

…#1354) Add conversation template for open hermes mistral Co-authored-by: David Pissarra <david.pissarra@tecnico.ulisboa.pt>

[Tokenizer] Prioritize huggingface tokenizer.json, generate one if no…

5315d18

…t included (mlc-ai#1352) * Prioritize tokenzer json, generate one if not included

[Rust] Prepare for publishing (mlc-ai#1342)

76c2807

Prepare for publishing.

Fix ft quantization scale computation (mlc-ai#1321)

a4a06d5

Merge remote-tracking branch 'mlc-ai/main' into merge-dec01

4cefcc9

fix

69eaa38

fix

17c3678

masahi merged commit b9ca4a1 into octoml:batch-serving Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream merge Dec 01 #94

Upstream merge Dec 01 #94

masahi commented Dec 1, 2023

Upstream merge Dec 01 #94

Upstream merge Dec 01 #94

Conversation

masahi commented Dec 1, 2023