Stars
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Official repository for paper "Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving"
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
[NeurIPS 2024 D&B Track] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
An open source implementation of CLIP.
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
[MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
HPHS: Hierarchical Planning based on Hybrid Frontier Sampling for Unknown Environments Exploration
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
A collection of papers on Diffusion for Image-to-Image Translation and Style Transfer
A collection of awesome resources image-to-image translation.
A Framework of Small-scale Large Multimodal Models
A collection of visual instruction tuning datasets.