The Llama 3.2 3B model is a lightweight, text-only version of the Llama 3.2 model, designed to be highly capable with multilingual text generation and tool-calling abilities.
- Highly capable with multilingual text generation\
- Tool-calling abilities for direct interaction with external tools and services\
- Optimized for edge and mobile devices\
- Supports context length of 128K tokens\
- Available for fine-tuning and deployment on a variety of platforms\
- Part of the Llama 3.2 ecosystem, providing seamless integration with other Llama models
- Model size: 3B parameters\
- Context length: 128K tokens\
- Input type: Text\
- Output type: Text\
- Pre-trained on: Large-scale noisy (text) pair data\
- Fine-tuned on: Medium-scale high-quality in-domain and knowledge-enhanced (text) pair data\
- Weights: Based on BFloat16 numerics
- Outperforms Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, summarization, prompt rewriting, and tool-use\
- Competitive with Gemma 2 2.6B model on tasks such as summarization and prompt rewriting
- Personalized on-device agentic applications with strong privacy\
- Text summarization and generation\
- Instruction following and prompt rewriting\
- Tool-calling for direct interaction with external tools and services\
- Multilingual text generation and translation
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B
,BAAI/AquilaChat2-34B
,BAAI/Aquila-7B
,BAAI/AquilaChat-7B
, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat
,baichuan-inc/Baichuan-7B
, etc.) - BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - ChatGLM (
THUDM/chatglm2-6b
,THUDM/chatglm3-6b
, etc.) - Command-R (
CohereForAI/c4ai-command-r-v01
, etc.) - DBRX (
databricks/dbrx-base
,databricks/dbrx-instruct
etc.) - DeciLM (
Deci/DeciLM-7B
,Deci/DeciLM-7B-instruct
, etc.) - Falcon (
tiiuae/falcon-7b
,tiiuae/falcon-40b
,tiiuae/falcon-rw-7b
, etc.) - Gemma (
google/gemma-2b
,google/gemma-7b
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - InternLM (
internlm/internlm-7b
,internlm/internlm-chat-7b
, etc.) - InternLM2 (
internlm/internlm2-7b
,internlm/internlm2-chat-7b
, etc.) - Jais (
core42/jais-13b
,core42/jais-13b-chat
,core42/jais-30b-v3
,core42/jais-30b-chat-v3
, etc.) - LLaMA, Llama 2, and Meta Llama 3 (
meta-llama/Meta-Llama-3-8B-Instruct
,meta-llama/Meta-Llama-3-70B-Instruct
,meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,young-geng/koala
,openlm-research/open_llama_13b
, etc.) - MiniCPM (
openbmb/MiniCPM-2B-sft-bf16
,openbmb/MiniCPM-2B-dpo-bf16
, etc.) - Mistral (
mistralai/Mistral-7B-v0.1
,mistralai/Mistral-7B-Instruct-v0.1
, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1
,mistralai/Mixtral-8x7B-Instruct-v0.1
,mistral-community/Mixtral-8x22B-v0.1
, etc.) - MPT (
mosaicml/mpt-7b
,mosaicml/mpt-30b
, etc.) - OLMo (
allenai/OLMo-1B-hf
,allenai/OLMo-7B-hf
, etc.) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.) - Orion (
OrionStarAI/Orion-14B-Base
,OrionStarAI/Orion-14B-Chat
, etc.) - Phi (
microsoft/phi-1_5
,microsoft/phi-2
, etc.) - Phi-3 (
microsoft/Phi-3-mini-4k-instruct
,microsoft/Phi-3-mini-128k-instruct
, etc.) - Qwen (
Qwen/Qwen-7B
,Qwen/Qwen-7B-Chat
, etc.) - Qwen2 (
Qwen/Qwen1.5-7B
,Qwen/Qwen1.5-7B-Chat
, etc.) - Qwen2MoE (
Qwen/Qwen1.5-MoE-A2.7B
,Qwen/Qwen1.5-MoE-A2.7B-Chat
, etc.) - StableLM(
stabilityai/stablelm-3b-4e1t
,stabilityai/stablelm-base-alpha-7b-v2
, etc.) - Starcoder2(
bigcode/starcoder2-3b
,bigcode/starcoder2-7b
,bigcode/starcoder2-15b
, etc.) - Xverse (
xverse/XVERSE-7B-Chat
,xverse/XVERSE-13B-Chat
,xverse/XVERSE-65B-Chat
, etc.) - Yi (
01-ai/Yi-6B
,01-ai/Yi-34B
, etc.)
Visit our documentation to get started.