[Transform] Apply split_rotary optimization on prefill (mlc-ai#1033)

* [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary
masahi · Oct 12, 2023 · b9179cf · b9179cf
1 parent 1e6fb11
commit b9179cf
Show file tree

Hide file tree

Showing 2 changed files with 260 additions and 203 deletions.
diff --git a/mlc_llm/core.py b/mlc_llm/core.py
@@ -402,12 +402,11 @@ def mod_transform_before_build(
         if max_seq_len:
             num_key_value_heads = config.get_num_key_value_heads()
             mod = fuse_split_rotary_embedding(
-                mod,
                 config.num_attention_heads // args.num_shards,
                 num_key_value_heads // args.num_shards,
                 config.hidden_size // args.num_shards,
                 config.position_embedding_base,
-            )
+            )(mod)
 
     if args.target_kind == "cuda":
         patterns = []