Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Grammar] Integrate with XGrammar #635

Merged
merged 10 commits into from
Nov 22, 2024
Merged

Conversation

CharlieFRuan
Copy link
Contributor

@CharlieFRuan CharlieFRuan commented Nov 22, 2024

This PR integrates with XGrammar: https://github.com/mlc-ai/xgrammar.

Prior to this PR, grammar is supported by the grammar portion of MLC-LLM compiled into the model WASM. That portion is now a standalone project XGrammar. Therefore, this PR adds mlc-ai/web-xgrammar as part of the dependency and remove src/grammar.ts. We update llm_chat.ts accordingly for xgrammar's APIs.

In addition, besides json_schema, we now also support requests with EBNF-formatted strings by using the following in the chat completion request. See examples/json-schema's ebnfGrammarExample() for a full example.

    response_format: {
      type: "grammar",
      grammar: jsonGrammarStr,
    } as webllm.ResponseFormat,

We also add the following performance info:

  • Add grammar_init_ms and grammar_per_token_ms to CompletionUsage.extra when using grammar
  • Add time_to_first_token_s (TTFT) and time_per_output_token_s (TPOT), e2e_latency_s to CompletionUsage.extra

We also add ignore_eos to Completion and ChatCompletion requests, which can be useful for benchmarking purposes.

@CharlieFRuan CharlieFRuan marked this pull request as ready for review November 22, 2024 12:08
@tqchen tqchen merged commit c6b1b4e into mlc-ai:main Nov 22, 2024
1 check passed
CharlieFRuan added a commit that referenced this pull request Nov 22, 2024
### Change

- #635
  - Integrate with `web-xgrammar`
- Support `ResponseFormat.type == "grammar"`, where you specify an EBNF
grammar string
- Add `grammar_init_ms` and `grammar_per_token_ms` to
`CompletionUsage.extra` when using grammar
- Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s`
(TPOT), `e2e_latency_s` to `CompletionUsage.extra`
  - Add `ignore_eos` to `Completion` and `ChatCompletion` requests
- #632
  - Fixes of vram requirement for Qwen2.5-Coder-1.5B-Instruct model

### TVMjs
- No change, version `0.18.0-dev2` just like 0.2.71
jzhao62 pushed a commit to jzhao62/web-llm that referenced this pull request Dec 8, 2024
This PR integrates with XGrammar: https://github.com/mlc-ai/xgrammar.

Prior to this PR, grammar is supported by the grammar portion of MLC-LLM
compiled into the model WASM. That portion is now a standalone project
XGrammar. Therefore, this PR adds `mlc-ai/web-xgrammar` as part of the
dependency and remove `src/grammar.ts`. We update `llm_chat.ts`
accordingly for xgrammar's APIs.

In addition, besides `json_schema`, we now also support requests with
EBNF-formatted strings by using the following in the chat completion
request. See `examples/json-schema`'s `ebnfGrammarExample()` for a full
example.

```typescript
    response_format: {
      type: "grammar",
      grammar: jsonGrammarStr,
    } as webllm.ResponseFormat,
```

We also add the following performance info:
- Add `grammar_init_ms` and `grammar_per_token_ms` to
`CompletionUsage.extra` when using grammar
- Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s`
(TPOT), `e2e_latency_s` to `CompletionUsage.extra`

We also add `ignore_eos` to `Completion` and `ChatCompletion` requests,
which can be useful for benchmarking purposes.
jzhao62 pushed a commit to jzhao62/web-llm that referenced this pull request Dec 8, 2024
### Change

- mlc-ai#635
  - Integrate with `web-xgrammar`
- Support `ResponseFormat.type == "grammar"`, where you specify an EBNF
grammar string
- Add `grammar_init_ms` and `grammar_per_token_ms` to
`CompletionUsage.extra` when using grammar
- Add `time_to_first_token_s` (TTFT) and `time_per_output_token_s`
(TPOT), `e2e_latency_s` to `CompletionUsage.extra`
  - Add `ignore_eos` to `Completion` and `ChatCompletion` requests
- mlc-ai#632
  - Fixes of vram requirement for Qwen2.5-Coder-1.5B-Instruct model

### TVMjs
- No change, version `0.18.0-dev2` just like 0.2.71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants