The Llama 3.2 11B-Vision-Instruct model is a vision-based version of the Llama 3.2 model, designed to be highly capable with visual reasoning and instruction following abilities. This model is ideal for building personalized, on-device agentic applications with strong privacy, where data never leaves the device.
- Highly capable with visual reasoning and instruction following abilities
- Supports image understanding and visual grounding tasks
- Optimized for edge and mobile devices
- Supports context length of 128K tokens
- Available for fine-tuning and deployment on a variety of platforms
- Part of the Llama 3.2 ecosystem, providing seamless integration with other Llama models
- Model size: 11B parameters
- Context length: 128K tokens
- Input type: Text and image
- Output type: Text and image
- Pre-trained on: Large-scale noisy (text, image) pair data
- Fine-tuned on: Medium-scale high-quality in-domain and knowledge-enhanced (text, image) pair data
- Weights: Based on BFloat16 numerics
- Quantized variants: Currently in development
- Competitive with leading foundation models on image recognition and visual understanding tasks
- Outperforms Gemma 2 2.6B and Phi 3.5-mini models on tasks such as following instructions, visual grounding, and image captioning
- Competitive with Gemma 2 2.6B model on tasks such as visual reasoning and image captioning
- Personalized on-device agentic applications with strong privacy
- Visual reasoning and instruction following
- Image understanding and visual grounding
- Image captioning and generation
- Multimodal text and image generation
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs
- (Experimental) Prefix caching support
- (Experimental) Multi-lora support
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B
,BAAI/AquilaChat2-34B
,BAAI/Aquila-7B
,BAAI/AquilaChat-7B
, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat
,baichuan-inc/Baichuan-7B
, etc.) - BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - ChatGLM (
THUDM/chatglm2-6b
,THUDM/chatglm3-6b
, etc.) - Command-R (
CohereForAI/c4ai-command-r-v01
, etc.) - DBRX (
databricks/dbrx-base
,databricks/dbrx-instruct
etc.) - DeciLM (
Deci/DeciLM-7B
,Deci/DeciLM-7B-instruct
, etc.) - Falcon (
tiiuae/falcon-7b
,tiiuae/falcon-40b
,tiiuae/falcon-rw-7b
, etc.) - Gemma (
google/gemma-2b
,google/gemma-7b
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - InternLM (
internlm/internlm-7b
,internlm/internlm-chat-7b
, etc.) - InternLM2 (
internlm/internlm2-7b
,internlm/internlm2-chat-7b
, etc.) - Jais (
core42/jais-13b
,core42/jais-13b-chat
,core42/jais-30b-v3
,core42/jais-30b-chat-v3
, etc.) - LLaMA, Llama 2, and Meta Llama 3 (
meta-llama/Meta-Llama-3-8B-Instruct
,meta-llama/Meta-Llama-3-70B-Instruct
,meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,young-geng/koala
,openlm-research/open_llama_13b
, etc.) - MiniCPM (
openbmb/MiniCPM-2B-sft-bf16
,openbmb/MiniCPM-2B-dpo-bf16
, etc.) - Mistral (
mistralai/Mistral-7B-v0.1
,mistralai/Mistral-7B-Instruct-v0.1
, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1
,mistralai/Mixtral-8x7B-Instruct-v0.1
,mistral-community/Mixtral-8x22B-v0.1
, etc.) - MPT (
mosaicml/mpt-7b
,mosaicml/mpt-30b
, etc.) - OLMo (
allenai/OLMo-1B-hf
,allenai/OLMo-7B-hf
, etc.) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.) - Orion (
OrionStarAI/Orion-14B-Base
,OrionStarAI/Orion-14B-Chat
, etc.) - Phi (
microsoft/phi-1_5
,microsoft/phi-2
, etc.) - Phi-3 (
microsoft/Phi-3-mini-4k-instruct
,microsoft/Phi-3-mini-128k-instruct
, etc.) - Qwen (
Qwen/Qwen-7B
,Qwen/Qwen-7B-Chat
, etc.) - Qwen2 (
Qwen/Qwen1.5-7B
,Qwen/Qwen1.5-7B-Chat
, etc.) - Qwen2MoE (
Qwen/Qwen1.5-MoE-A2.7B
,Qwen/Qwen1.5-MoE-A2.7B-Chat
, etc.) - StableLM(
stabilityai/stablelm-3b-4e1t
,stabilityai/stablelm-base-alpha-7b-v2
, etc.) - Starcoder2(
bigcode/starcoder2-3b
,bigcode/starcoder2-7b
,bigcode/starcoder2-15b
, etc.) - Xverse (
xverse/XVERSE-7B-Chat
,xverse/XVERSE-13B-Chat
,xverse/XVERSE-65B-Chat
, etc.) - Yi (
01-ai/Yi-6B
,01-ai/Yi-34B
, etc.)
Visit our documentation to get started.