-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model: Add support for PhiMoE arch #11003
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I am not particularly good at coding, but I can try running your gguf and check, if I notice something. No time today, but tomorrow I can do so. |
Thanks, no hurry as the model is quite old and phi4 has been released already. Will see if it gains enthousiasm, I am having a look to the Vision model in //. |
The Q4_0 with 4096 context does not fit into 32GB of RAM on Windows 10. Output is reasonable. Sometimes I have seen typos (e.g. instead of Successful run with 32768 allocated tokens for context (Prompt was 16883 tokens)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Ran llama-bench
on Phi3.5 MoE Q4
.
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L40S, compute capability 8.9, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| phimoe 16x3.8B Q4_0 | 21.98 GiB | 41.87 B | CUDA | 99 | pp512 | 4647.09 ± 52.40 |
| phimoe 16x3.8B Q4_0 | 21.98 GiB | 41.87 B | CUDA | 99 | tg128 | 98.37 ± 0.03 |
build: 0dae7685 (4398)
0dae768
to
4ca3a77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased this on the latest master
Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
4ca3a77
to
c0dd28d
Compare
* model: support phimoe * python linter * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: minor Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com> * doc: add phimoe as supported model ggml-ci --------- Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
nice one dude. |
PhiMoE
Overview
Phi-3.5-MoE is a lightweight, open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available documents - with a focus on very high-quality, reasoning dense data.
The model supports multilingual and comes with 128K context length (in tokens).
The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.
Mixtral
with the main difference of [Phi3LongRoPEScaledRotaryEmbedding
], where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP's up and gate projection layers are also fused.LlamaTokenizer
], with additional tokens.License
MIT
Implementation details
The convert script reuses the
Phi3MiniModel
class as parameter names and long rope scaling logic is the same.The MOE branch is included in the phi3 model graph implementation with missing bias tensors.
It would be possible to merge phi3 and phimoe into a single arch, but I kept the spirit of separated moe arch as in granite recently. Also, since Microsoft introduced a dedicated architecture, it can evolve independently in the future.
Testing
full output
Check that phi3 is still working
full output
Links