[Grammar] Integrate with XGrammar #635

CharlieFRuan · 2024-11-22T11:32:38Z

This PR integrates with XGrammar: https://github.com/mlc-ai/xgrammar.

Prior to this PR, grammar is supported by the grammar portion of MLC-LLM compiled into the model WASM. That portion is now a standalone project XGrammar. Therefore, this PR adds mlc-ai/web-xgrammar as part of the dependency and remove src/grammar.ts. We update llm_chat.ts accordingly for xgrammar's APIs.

In addition, besides json_schema, we now also support requests with EBNF-formatted strings by using the following in the chat completion request. See examples/json-schema's ebnfGrammarExample() for a full example.

    response_format: {
      type: "grammar",
      grammar: jsonGrammarStr,
    } as webllm.ResponseFormat,

We also add the following performance info:

Add grammar_init_ms and grammar_per_token_ms to CompletionUsage.extra when using grammar
Add time_to_first_token_s (TTFT) and time_per_output_token_s (TPOT), e2e_latency_s to CompletionUsage.extra

We also add ignore_eos to Completion and ChatCompletion requests, which can be useful for benchmarking purposes.

### Change - #635 - Integrate with `web-xgrammar` - Support `ResponseFormat.type == "grammar"`, where you specify an EBNF grammar string - Add `grammar_init_ms` and `grammar_per_token_ms` to `CompletionUsage.extra` when using grammar - Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s` (TPOT), `e2e_latency_s` to `CompletionUsage.extra` - Add `ignore_eos` to `Completion` and `ChatCompletion` requests - #632 - Fixes of vram requirement for Qwen2.5-Coder-1.5B-Instruct model ### TVMjs - No change, version `0.18.0-dev2` just like 0.2.71

This PR integrates with XGrammar: https://github.com/mlc-ai/xgrammar. Prior to this PR, grammar is supported by the grammar portion of MLC-LLM compiled into the model WASM. That portion is now a standalone project XGrammar. Therefore, this PR adds `mlc-ai/web-xgrammar` as part of the dependency and remove `src/grammar.ts`. We update `llm_chat.ts` accordingly for xgrammar's APIs. In addition, besides `json_schema`, we now also support requests with EBNF-formatted strings by using the following in the chat completion request. See `examples/json-schema`'s `ebnfGrammarExample()` for a full example. ```typescript response_format: { type: "grammar", grammar: jsonGrammarStr, } as webllm.ResponseFormat, ``` We also add the following performance info: - Add `grammar_init_ms` and `grammar_per_token_ms` to `CompletionUsage.extra` when using grammar - Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s` (TPOT), `e2e_latency_s` to `CompletionUsage.extra` We also add `ignore_eos` to `Completion` and `ChatCompletion` requests, which can be useful for benchmarking purposes.

### Change - mlc-ai#635 - Integrate with `web-xgrammar` - Support `ResponseFormat.type == "grammar"`, where you specify an EBNF grammar string - Add `grammar_init_ms` and `grammar_per_token_ms` to `CompletionUsage.extra` when using grammar - Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s` (TPOT), `e2e_latency_s` to `CompletionUsage.extra` - Add `ignore_eos` to `Completion` and `ChatCompletion` requests - mlc-ai#632 - Fixes of vram requirement for Qwen2.5-Coder-1.5B-Instruct model ### TVMjs - No change, version `0.18.0-dev2` just like 0.2.71

CharlieFRuan added 10 commits November 22, 2024 06:26

[Grammar] Integrate with web-xgrammar

8d530cd

Rebase to new xgrammar, use tokenizerInfo and add full bitmask size

2715813

Allow grammar response format, on top of schema

308660a

Add metrics for grammar overhead, overlap grammar init

8a987f3

Trivial

81ce64c

Add TTFT and TPOT to usage extra

deb5a26

Trivial, use root instead of main in ebnf

6480f52

Add metric E2E latency to usage extra

a63a534

Support ignore_eos

6e23f5d

Update to new XGrammar pre-release APIs

2fed3e8

CharlieFRuan marked this pull request as ready for review November 22, 2024 12:08

tqchen merged commit c6b1b4e into mlc-ai:main Nov 22, 2024
1 check passed

CharlieFRuan mentioned this pull request Nov 22, 2024

[Version] Bump version to 0.2.76 #637

Merged

CharlieFRuan mentioned this pull request Nov 22, 2024

Direct BNF support #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Grammar] Integrate with XGrammar #635

[Grammar] Integrate with XGrammar #635

CharlieFRuan commented Nov 22, 2024 •

edited

Loading

[Grammar] Integrate with XGrammar #635

[Grammar] Integrate with XGrammar #635

Conversation

CharlieFRuan commented Nov 22, 2024 • edited Loading

CharlieFRuan commented Nov 22, 2024 •

edited

Loading