- Large Language Model
- Large Vision Model
- Large MMM for Perception
- Large MMM for Generation
- Large MMM for Unification
- Large Model Distillation
- Related Survey
- Related Benchmark
(arXiv2018_GPT) Improving Language Understanding by Generative Pre-Training.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever.
[paper]
[code]
(NAACL2019_BERT) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.
[paper]
[code]
(arXiv2019_GPT-2) Language Models are Unsupervised Multitask Learners.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever.
[paper]
[code]
(NeurIPS2019_UniLM) Unified Language Model Pre-training for Natural Language Understanding and Generation.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon.
[paper]
[code]
(NeurIPS2019_XLNet) XLNet: Generalized Autoregressive Pretraining for Language Understanding.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
[paper]
[code]
(ICML2020_UniLMv2) UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training.
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon.
[paper]
[code]
(arXiv2020_GPT-3) Language Models are Few-Shot Learners.
OpenAI Team.
[paper]
[code]
(arXiv2021_RoPE) RoFormer: Enhanced Transformer with Rotary Position Embedding.
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu.
[paper]
[code]
(arXiv2022_PaLM) PaLM: Scaling Language Modeling with Pathways.
Google Research.
[paper]
[code]
(arXiv2023_LLaMA) LLaMA: Open and Efficient Foundation Language Models.
LLaMA Team.
[paper]
[code]
(arXiv2023_RWKV) RWKV: Reinventing RNNs for the Transformer Era.
RWKV Team.
[paper]
[code]
(arXiv2023_LLM-Judge) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica.
[paper]
[code]
(arXiv2023_RETNET) Retentive Network: A Successor to Transformer for Large Language Models.
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei.
[paper]
[code]
(arXiv2023_Llama 2) Llama 2: Open Foundation and Fine-Tuned Chat Models.
LLaMA Team.
[paper]
[code]
(arXiv2023_InternLM) InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities.
InternLM Team.
[paper]
[code]
(arXiv2023_Qwen) Qwen Technical Report.
Qwen Team.
[paper]
[code]
(arXiv2023_LightSeq) LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers.
Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, Hao Zhang.
[paper]
[code]
(arXiv2023_Mamba) Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Albert Gu, Tri Dao.
[paper]
[code]
(arXiv2024_Mixtral) Mixtral of Experts.
Mistral.AI.
[paper]
[code]
(arXiv2024_OLMo) OLMo: Accelerating the Science of Language Models.
Allen Institute.AI.
[paper]
[code]
(arXiv2024_Scaling) Unraveling the Mystery of Scaling Laws: Part I.
Hui Su, Zhi Tian, Xiaoyu Shen, Xunliang Cai.
[paper]
(arXiv2024_Phi-3) Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
Microsoft Team.
[paper]
(arXiv2024_Mambav2) Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.
Tri Dao, Albert Gu.
[paper]
[code]
(arXiv2024_Qwen2) Qwen2 Technical Report.
Qwen Team.
[paper]
[code]
(arXiv2024_Llama3) The Llama 3 Herd of Models.
Llama Team.
[paper]
[model]
(arXiv2024_Gemma2) Gemma 2: Improving Open Language Models at a Practical Size.
DeepMind Team.
[paper]
[code]
(arXiv2024_OLMoE) OLMoE: Open Mixture-of-Experts Language Models.
OLMoE Team.
[paper]
[code]
(arXiv2024_DIFF-Transformer) Differential Transformer.
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei.
[paper]
(arXiv2024_Coconut) Training Large Language Models to Reason in a Continuous Latent Space.
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian.
[paper]
(arXiv2024_Phi-4) Phi-4 Technical Report.
Microsoft Research.
[paper]
(ICLR2021_ViT) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
[paper]
[code]
(ICCV2021_ViViT) ViViT: A Video Vision Transformer.
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
[paper]
[code]
(arXiv2021_MLP-Mixer) MLP-Mixer: An all-MLP Architecture for Vision.
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy.
[paper]
[code]
(ICLR2022_BEiT) BEiT: BERT Pre-Training of Image Transformers.
Hangbo Bao, Li Dong, Songhao Piao, Furu Wei.
[paper]
[code]
(CVPR2022_MAE) Masked Autoencoders Are Scalable Vision Learners.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
[paper]
[code]
(CVPR2022_RegionCLIP) RegionCLIP: Region-based Language-Image Pretraining.
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao.
[paper]
[code]
(CVPR2022_Uni-Perceiver) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks.
Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai.
[paper]
[code]
(ICLR2022_UniFormer) UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning.
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao.
[paper]
[code]
(ECCV2022_MVP) MVP: Multimodality-guided Visual Pre-training.
Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian.
[paper]
(arXiv2022_Pix2Seq) A Unified Sequence Interface for Vision Tasks.
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton.
[paper]
[code]
(arXiv2022_Unified-IO) Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi.
[paper]
[code]
(arXiv2022_BEiTv2) BEiTv2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei.
[paper]
[code]
(arXiv2022_Visual-Prompting) Visual Prompting via Image Inpainting.
Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei A. Efros.
[paper]
[code]
(ICLR2023_CLIP-ViP) CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment.
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo.
[paper]
[code]
(ICLR2023_PaLI) PaLI: A Jointly-Scaled Multilingual Language-Image Model.
Google Research.
[paper]
[code]
(CVPR2023_OVSeg) Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP.
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu.
[paper]
[code]
(ICLR2023_ToME) Token Merging: Your ViT But Faster.
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.
[paper]
[code]
(CVPR2023_InternImage) InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao.
[paper]
[code]
(CVPR2023_EVA) EVA: Exploring the Limits of Masked Visual Representation Learning at Scale.
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao.
[paper]
[code]
(CVPR2023_MAGE) MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis.
Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan.
[paper]
[code]
(arXiv2023_UniFormerV2) UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao.
[paper]
[code]
(CVPR2023_M3I) Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information.
Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie Zhou, Jifeng Dai.
[paper]
[code]
(arXiv2022_Uni-Perceiverv2) Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks.
Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai.
[paper]
[code]
(CVPR2023_FLIP) Scaling Language-Image Pre-training via Masking.
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He.
[paper]
[code]
(CVPR2023_Painter) Images Speak in Images: A Generalist Painter for In-Context Visual Learning.
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang.
[paper]
[code]
(CVPR2023_MAGVIT) MAGVIT: Masked Generative Video Transformer.
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang.
[paper]
[code]
(CVPR2023_FlexiViT) FlexiViT: One Model for All Patch Sizes.
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic.
[paper]
[code]
(CVPR2023_X-Decoder) Generalized Decoding for Pixel, Image, and Language.
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao.
[paper]
[code]
(ECCV2024_GroundingDINO) Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang.
[paper]
[code]
(ICML2023_ViT-22B) Scaling Vision Transformers to 22 Billion Parameters.
Google Research.
[paper]
(arXiv2023_EVA-02) EVA-02: A Visual Representation for Neon Genesis.
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao.
[paper]
[code]
(ICCV2023_SigLIP) Sigmoid Loss for Language Image Pre-Training.
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
[paper]
[code]
(arXiv2023_EVA-CLIP) EVA-CLIP: Improved Training Techniques for CLIP at Scale.
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao.
[paper]
[code]
(ICCV2023_UMT) Unmasked Teacher: Towards Training-Efficient Video Foundation Models.
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, Yu Qiao.
[paper]
[code]
(CVPR2023_VideoMAEv2) VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking.
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao.
[paper]
[code]
(arXiv2023_SAM) Segment Anything.
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick.
[paper]
[code]
(ICCV2023_SegGPT) SegGPT: Segmenting Everything In Context.
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang.
[paper]
[code]
(arXiv2023_CLIP_Surgery) A Closer Look at the Explainability of Contrastive Language-Image Pre-training.
Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, Xiaomeng Li.
[paper]
[code]
(CVPRW2023_SAM-not-perfect) Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications.
Wei Ji, Jingjing Li, Qi Bi, Tingwei Liu, Wenbo Li, Li Cheng.
[paper]
[code]
(NeurIPS2023_SEEM) Segment Everything Everywhere All at Once.
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee.
[paper]
[code]
(arXiv2023_FastSAM) Fast Segment Anything.
Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, Jinqiao Wang.
[paper]
[code]
(ICLR2024_MAGVITv2) Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation.
Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang.
[paper]
(CVPR2024_LVM) Sequential Modeling Enables Scalable Learning for Large Vision Models.
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros.
[paper]
[code]
(NeurIPS2024_FIND) Interfacing Foundation Models' Embeddings.
Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang.
[paper]
[code]
(arXiv2024_AIM) Scalable Pre-training of Large Autoregressive Image Models.
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin.
[paper]
[code]
(arXiv2024_VIM) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model.
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang.
[paper]
[code]
(arXiv2024_EVA-CLIP-18B) EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters.
Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang.
[paper]
[code]
(arXiv2024_VisionLLaMA) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks.
Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen.
[paper]
[code]
(arXiv2024_Vision-RWKV) Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures.
Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang.
[paper]
[code]
(arXiv2024_VideoMamba) VideoMamba: State Space Model for Efficient Video Understanding.
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2024_MM-GEM) Multi-Modal Generative Embedding Model.
Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun.
[paper]
(CVPR2024w_EgoVideo) EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation.
Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2024_MambaVision) MambaVision: A Hybrid Mamba-Transformer Vision Backbone.
Ali Hatamizadeh, Jan Kautz.
[paper]
[code]
(arXiv2024_SAM2) SAM 2: Segment Anything in Images and Videos.
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
[paper]
[code]
(NeurIPS2023_VisionLLM) VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks.
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai.
[paper]
[code]
(NeurIPS2023_RECODE) Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models.
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen.
[paper]
[code]
(arXiv2023_DetGPT) DetGPT: Detect What You Need via Reasoning.
Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang.
[paper]
[code]
(arXiv2023_GRILL) GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions.
Woojeong Jin, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Weizhu Chen, Ahmed Hassan Awadallah, Damien Jose, Xiang Ren.
[paper]
(arXiv2023_DAC) Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models.
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky.
[paper]
(arXiv2023_Kosmos-2) Kosmos-2: Grounding Multimodal Large Language Models to the World.
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.
[paper]
[code]
(arXiv2023_Shikra) Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic.
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, Rui Zhao.
[paper]
[code]
(arXiv2023_BuboGPT) BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs.
Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang.
[paper]
[code]
(arXiv2023_ChatSpot) ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning.
Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang.
[paper]
(CVPR2024_LISA) LISA: Reasoning Segmentation via Large Language Model.
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia.
[paper]
[code]
(ICLR2024_All-Seeing) The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World.
Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao.
[paper]
[code]
(ICLR2024_Ferret) Ferret: Refer and Ground Anything Anywhere at Any Granularity.
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang.
[paper]
[code]
(arXiv2023_MiniGPT-v2) MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning.
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny.
[paper]
[code]
(CVPR2024_LLM4SGG) LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation.
Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park.
[paper]
[code]
(arXiv2023_SoM) Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V.
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao.
[paper]
[code]
(CVPR2024_GLaMM) GLaMM: Pixel Grounding Large Multimodal Model.
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan.
[paper]
[code]
(CVPR2024_LION) LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge.
Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie.
[paper]
[code]
(arXiv2023_DINOv) Visual In-Context Prompting.
Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao.
[paper]
[code]
(arXiv2023_TAP) Tokenize Anything via Prompting.
Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan.
[paper]
[code]
(arXiv2023_Osprey) Osprey: Pixel Understanding with Visual Instruction Tuning.
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu.
[paper]
[code]
(ECCV2024_All-Seeingv2) The All-Seeing Project V2: Towards General Relation Comprehension of the Open World.
Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai.
[paper]
[code]
(arXiv2023_AnyRef) Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception.
Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, Xuansong Xie.
[paper]
[code]
(arXiv2023_GiT) GiT: Towards Generalist Vision Transformer through Universal Language Interface.
Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang.
[paper]
[code]
(COLM2024_Ferretv2) Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models.
Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang.
[paper]
[code]
(arXiv2024_VisionLLMv2) VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks.
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai.
[paper]
[code]
(ICML2022_BLIP) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.
Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
[paper]
[code]
(ICML2022_OFA) OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang.
[paper]
[code]
(NeurIPS2022_Flamingo) Flamingo: a Visual Language Model for Few-Shot Learning.
DeepMind Team.
[paper]
[code]
(arXiv2022_CoCa) CoCa: Contrastive Captioners are Image-Text Foundation Models.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu.
[paper]
[code]
(CVPR2023_BEiTv3) Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks.
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei.
[paper]
[code]
(arXiv2023_BLIP-2) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.
[paper]
[code]
(ICML2023_mPLUG-2) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video.
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou.
[paper]
[code]
(arXiv2023_Kosmos-1) Language Is Not All You Need: Aligning Perception with Language Models.
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei.
[paper]
[code]
(arXiv2023_PaLM-E) PaLM-E: An Embodied Multimodal Language Model.
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence.
[paper]
[code]
(arXiv2023_Visual-ChatGPT) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan.
[paper]
[code]
(ICCV2023_ViperGPT) ViperGPT: Visual Inference via Python Execution for Reasoning.
Dídac Surís, Sachit Menon, Carl Vondrick.
[paper]
[code]
(arXiv2023_MM-REACT) MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action.
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang.
[paper]
[code]
(arXiv2023_LLaMA-Adapter) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention.
Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao.
[paper]
[code]
(NeurIPS2023_HuggingGPT) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang.
[paper]
[code]
(NeurIPS2023_LLaVA) Visual Instruction Tuning.
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee.
[paper]
[code]
(arXiv2023_MiniGPT-4) MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models.
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny.
[paper]
[code]
(arXiv2023_mPLUG-Owl) mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou.
[paper]
[code]
(arXiv2023_LLaMA-AdapterV2) LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model.
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao.
[paper]
[code]
(arXiv2023_Otter) Otter: A Multi-Modal Model with In-Context Instruction Tuning.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu.
[paper]
[code]
(arXiv2023_MultiModal-GPT) MultiModal-GPT: A Vision and Language Model for Dialogue with Humans.
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen.
[paper]
[code]
(arXiv2023_InternGPT) InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language.
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, Yu Qiao.
[paper]
[code]
(CVPR2023_ImageBind) ImageBind: One Embedding Space To Bind Them All.
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.
[paper]
[code]
(NeurIPS2023_InstructBLIP) InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
[paper]
[code]
(arXiv2023_ONE-PEACE) ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities.
Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou.
[paper]
[code]
(EMNLP2023_IdealGPT) IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models.
Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang.
[paper]
[code]
(NeurIPS2023_LaVIN) Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji.
[paper]
[code]
(arXiv2023_PandaGPT) PandaGPT: One Model To Instruction-Follow Them All.
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, Deng Cai.
[paper]
[code]
(NeurIPS2023_GPT4Tools) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction.
Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan.
[paper]
[code]
(arXiv2023_MIMIC-IT) MIMIC-IT: Multi-Modal In-Context Instruction Tuning.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu.
[paper]
[code]
(AAAI2024_MotionGPT) MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators.
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang.
[paper]
[code]
(arXiv2023_Meta-Transformer) Meta-Transformer: A Unified Framework for Multimodal Learning.
Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue.
[paper]
[code]
(Blog2023_IDEFICS) Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model.
Hugo Laurençon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, Victor Sanh.
[blog]
(AAAI2024_BLIVA) BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions.
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu.
[paper]
[code]
(arXiv2023_Qwen-VL) Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
Qwen Team.
[paper]
[code]
(ACL2024_TextBind) TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild.
Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi.
[paper]
[code]
(arXiv2023_Kosmos-2.5) Kosmos-2.5: A Multimodal Literate Model.
Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei.
[paper]
[code]
(arXiv2023_X-Training) Small-scale proxies for large-scale Transformer training instabilities.
Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith.
[paper]
(arXiv2023_LLaVA-RLHF) Aligning Large Multimodal Models with Factually Augmented RLHF.
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell.
[paper]
[code]
(arXiv2023_InternLM-XComposer) InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition.
Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper]
[code]
(CVPR2024_LLaVA1.5) Improved Baselines with Visual Instruction Tuning.
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee.
[paper]
[code]
(arXiv2023_OpenLEAF) OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation.
Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo.
[paper]
(arXiv2023_COMM) From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models.
Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin'e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, Hongkai Xiong.
[paper]
[code]
(arXiv2023_Open X-Embodiment) Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
Open X-Embodiment Collaboration.
[paper]
[code]
(arXiv2023_Woodpecker) Woodpecker: Hallucination Correction for Multimodal Large Language Models.
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen.
[paper]
[code]
(CVPR2024_CapsFusion) CapsFusion: Rethinking Image-Text Data at Scale.
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu.
[paper]
[code]
(Blog2023_Fuyu-8B) Fuyu-8B: A Multimodal Architecture for AI Agents.
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
[blog]
(Blog2024_Fuyu-Heavy) Fuyu-Heavy: A New Multimodal Model.
Adept Team.
[blog]
(arXiv2023_CogVLM) CogVLM: Visual Expert for Pretrained Language Models.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang.
[paper]
[code]
(arXiv2023_OtterHD) OtterHD: A High-Resolution Multi-modality Model.
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu.
[paper]
[code]
(arXiv2023_mPLUG-Owl2) mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration.
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou.
[paper]
[code]
(CVPR2024_Monkey) Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models.
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai.
[paper]
[code]
(arXiv2023_LVIS-Instruct4V) To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning.
Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang.
[paper]
[code]
(arXiv2023_ShareGPT4V) ShareGPT4V: Improving Large Multi-Modal Models with Better Captions.
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin.
[paper]
[code]
(CVPR2024_Honeybee) Honeybee: Locality-enhanced Projector for Multimodal LLM.
Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh.
[paper]
[code]
(CVPR2024_VILA) VILA: On Pre-training for Visual Language Models.
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han.
[paper]
[code]
(arXiv2023_CogAgent) CogAgent: A Visual Language Model for GUI Agents.
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang.
[paper]
[code]
(CVPR2024_InternVL) InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai.
[paper]
[code]
(arXiv2023_MobileVLM) MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices.
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen.
[paper]
[code]
(CVPR2024_MMVP) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs.
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie.
[paper]
[code]
(arXiv2024_InternLM-XComposer2) InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model.
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper]
[code]
(arXiv2024_MobileVLM-V2) MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen.
[paper]
[code]
(ICML2024_Prismatic-VLMs) Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models.
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh.
[paper]
[code]
(arXiv2024_Bunny) Efficient Multimodal Learning from Data-centric Perspective.
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao.
[paper]
[code]
(arXiv2024_DeepSeek-VL) DeepSeek-VL: Towards Real-World Vision-Language Understanding.
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan.
[paper]
[code]
(ECCV2024_FastV) An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models.
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang.
[paper]
[code]
(arXiv2024_LLaVA-UHD) LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images.
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang.
[paper]
[code]
(arXiv2024_CoS) Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models.
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, Jiwen Lu.
[paper]
[code]
(arXiv2024_S2) When Do We Not Need Larger Vision Models?.
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell.
[paper]
[code]
(arXiv2024_Cobra) Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference.
Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang.
[paper]
[code]
(arXiv2024_LLaVA-PruMerge) LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models.
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan.
[paper]
[code]
(Blog2024_Idefics2) Introducing Idefics2: A Powerful 8B Vision-Language Model for the community.
Leo Tronchon, Hugo Laurençon, Victor Sanh.
[blog]
(arXiv2024_InternLM-XComposer2-4KHD) InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD.
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper]
[code]
(ECCV2024_BRAVE) BRAVE: Broadening the visual encoding of vision-language models.
Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari.
[paper]
[code]
(arXiv2024_InternVL1.5) How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites.
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang.
[paper]
[code]
(arXiv2024_MANTIS) MANTIS: Interleaved Multi-Image Instruction Tuning.
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen.
[paper]
[code]
(NeurIPS2024_DenseConnector) Dense Connector for MLLMs.
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang.
[paper]
[code]
(arXiv2024_Ovis) Ovis: Structural Embedding Alignment for Multimodal Large Language Model.
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye.
[paper]
[code]
(arXiv2024_Cambrian-1) Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs.
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie.
[paper]
[code]
(Blog2024_LLaVA-NeXT) LLaVA-NeXT-series.
[blog]
(Blog2024_InternVL) InternVL-series.
[paper]
(NeurIPS2024_EVE) Unveiling Encoder-Free Vision-Language Models.
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang.
[paper]
[code]
(arXiv2024_InternLM-XComposer-2.5) InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output.
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang.
[paper]
[code]
(arXiv2024_SOLO) A Single Transformer for Scalable Vision-Language Modeling.
Yangyi Chen, Xingyao Wang, Hao Peng, Heng Ji.
[paper]
[code]
(arXiv2024_PaliGemma) PaliGemma: A versatile 3B VLM for transfer.
DeepMind Team.
[paper]
[code]
(arXiv2024_LMMs-Eval) LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models.
[blog]
[paper]
[code]
(arXiv2024_EVLM) EVLM: An Efficient Vision-Language Model for Visual Understanding.
Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang.
[paper]
(arXiv2024_VILA2) VILA2: VILA Augmented VILA.
Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin.
[paper]
[code]
(arXiv2024_MoMa) MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts.
Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan.
[paper]
(arXiv2024_MiniCPM-V) MiniCPM-V: A GPT-4V Level MLLM on Your Phone.
MiniCPM-V Team.
[paper]
[code]
(arXiv2024_LLaVA-OneVision) LLaVA-OneVision: Easy Visual Task Transfer.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li.
[paper]
[code]
(arXiv2024_mPLUG-Owl3) mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models.
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou.
[paper]
[code]
(arXiv2024_VITA) VITA: Towards Open-Source Interactive Omni Multimodal LLM.
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun.
[paper]
[code]
(arXiv2024_CROME) CROME: Cross-Modal Adapters for Efficient Multimodal LLM.
Sayna Ebrahimi, Sercan O. Arik, Tejas Nama, Tomas Pfister.
[paper]
(arXiv2024_BLIP-3) xGen-MM (BLIP-3): A Family of Open Large Multimodal Models.
Salesforce AI Research.
[paper]
[code]
(arXiv2024_MaVEn) MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model.
Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang.
[paper]
(Blog2024_QwenVL) QwenVL-series.
[blog]
(arXiv2024_Eagle) Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders.
Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu.
[paper]
[code]
(arXiv2024_AC-score) Law of Vision Representation in MLLMs.
Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu.
[paper]
[code]
(arXiv2024_NVLM) NVLM: Open Frontier-Class Multimodal LLMs.
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping.
[paper]
(arXiv2024_Qwen2-VL) Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution.
Qwen Team.
[paper]
[code]
(arXiv2024_Oryx) Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution.
Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao.
[paper]
[code]
(arXiv2024_SSC-DSC) Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models.
Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang.
[paper]
(arXiv2024_SparseVLM) SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference.
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang.
[paper]
[code]
(arXiv2024_Aria) Aria: An Open Multimodal Native Mixture-of-Experts Model.
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li.
[paper]
[code]
(arXiv2024_Pixtral-12B) Pixtral 12B.
Mistral.AI.
[paper]
[code]
(arXiv2024_Mono-InternVL) Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training.
Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jifeng Dai, Yu Qiao, Xizhou Zhu.
[paper]
[code]
(arXiv2024_ROSS) Reconstructive Visual Instruction Tuning.
Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang.
[paper]
[code]
(arXiv2024_Infinity-MM) Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data.
BAAI Group.
[paper]
[code]
(arXiv2024_GPT-4o) GPT-4o System Card.
OpenAI Group.
[paper]
(arXiv2024_TaskVector) Task Vectors are Cross-Modal.
Grace Luo, Trevor Darrell, Amir Bar.
[paper]
[code]
(arXiv2024_MoT) Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models.
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin.
[paper]
(arXiv2024_SAE) Large Multi-modal Models Can Interpret Features in Large Multi-modal Models.
Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu.
[paper]
(arXiv2024_InformationFlow) Cross-modal Information Flow in Multimodal Large Language Models.
Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova.
[paper]
(arXiv2024_PaliGemma2) PaliGemma 2: A Family of Versatile VLMs for Transfer.
DeepMind Team.
[paper]
(arXiv2024_Florence-VL) Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.
Microsoft Research.
[paper]
[code]
(arXiv2024_VisionZip) VisionZip: Longer is Better but Not Necessary in Vision Language Models.
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia.
[paper]
[code]
(arXiv2024_NVILA) NVILA: Efficient Frontier Visual Language Models.
NVIDIA Team.
[paper]
[code]
(arXiv2024_MAmmoTH-VL) MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale.
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue.
[paper]
[code]
(arXiv2024_CompCap) CompCap: Improving Multimodal Large Language Models with Composite Captions.
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He.
[paper]
(arXiv2024_InternVL2.5) Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.
Shanghai AI Laboratory.
[paper]
[code]
(arXiv2024_MMGiC) Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models.
Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan.
[paper]
[code]
(arXiv2024_Euclid) Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions.
Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger.
[paper]
(arXiv2022_InternVideo) InternVideo: General Video Foundation Models via Generative and Discriminative Learning.
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2022_VideoCoCa) VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners.
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu.
[paper]
(arXiv2023_VideoChat) VideoChat: Chat-Centric Video Understanding.
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2023_VideoLLM) VideoLLM: Modeling Video Sequence with Large Language Models.
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, Limin Wang.
[paper]
[code]
(arXiv2023_VSTAR) VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions.
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao.
[paper]
[code]
(EMNLP2023_Video-LLaMA) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding.
Hang Zhang, Xin Li, Lidong Bing.
[paper]
[code]
(arXiv2023_Video-ChatGPT) Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models.
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan.
[paper]
[code]
(arXiv2023_Valley) Valley: Video Assistant with Large Language model Enhanced abilitY.
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei.
[paper]
[code]
(CVPR2024_MovieChat) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding.
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang.
[paper]
[code]
(EMNLP2023_TESTA) TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding.
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou.
[paper]
[code]
(CVPR2024_Chat-UniVi) Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding.
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan.
[paper]
[code]
(arXiv2023_PG-Video-LLaVA) PG-Video-LLaVA: Pixel Grounding Large Video-Language Models.
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan.
[paper]
[code]
(arXiv2023_VideoChat2) MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2023_LLaMA-VID) LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models.
Yanwei Li, Chengyao Wang, Jiaya Jia.
[paper]
[code]
(arXiv2024_LSTP) LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding.
Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng.
[paper]
[code]
(arXiv2024_LLaVA-Hound-DPO) Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward.
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang.
[paper]
[code]
(arXiv2024_ShareGPT4Video) ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang.
[paper]
[code]
(arXiv2024_LongVA) Long Context Transfer from Language to Vision.
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu.
[paper]
[code]
(arXiv2024_SF-LLaVA) SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models.
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan.
[paper]
[code]
(arXiv2024_LongVILA) LongVILA: Scaling Long-Context Visual Language Models for Long Videos.
Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han.
[paper]
[code]
(arXiv2024_Video-CCAM) Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos.
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, Hui Wang.
[paper]
[code]
(arXiv2024_LongLLaVA) LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture.
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang.
[paper]
[code]
(arXiv2024_Video-XL) Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding.
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao.
[paper]
(arXiv2024_LLaVA-Video) Video Instruction Tuning With Synthetic Data.
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li.
[paper]
[code]
(arXiv2024_AuroraCap) AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark.
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning.
[paper]
[code]
(arXiv2024_BLIP-3-Video) xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs.
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles.
[paper]
[code]
(arXiv2024_LongVU) LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding.
Meta AI.
[paper]
[code]
(arXiv2024_VISTA) VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation.
Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen.
[paper]
[code]
(arXiv2024_InternLM-XComposer2.5-OmniLive) InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions.
Shanghai AI Laboratory.
[paper]
[code]
(ICML2020_iGPT) Generative Pretraining From Pixels.
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
[paper]
(NeurIPS2020_DDPM) Denoising Diffusion Probabilistic Models.
Jonathan Ho, Ajay Jain, Pieter Abbeel.
[paper]
[code]
(ICLR2021_DDIM) Denoising Diffusion Implicit Models.
Jiaming Song, Chenlin Meng, Stefano Ermon.
[paper]
[code]
(CVPR2022_MaskGIT) MaskGIT: Masked Generative Image Transformer.
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman.
[paper]
[code]
(ICCV2023_DiT) Scalable Diffusion Models with Transformers.
William Peebles, Saining Xie.
[paper]
[code]
(arXiv2023_GIVT) GIVT: Generative Infinite-Vocabulary Transformers.
Michael Tschannen, Cian Eastwood, Fabian Mentzer.
[paper]
[code]
(arXiv2024_FiT) FiT: Flexible Vision Transformer for Diffusion Model.
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai.
[paper]
[code]
(CVPR2024_V2T-Tokenizer) Beyond Text: Frozen Large Language Models in Visual Signal Comprehension.
Lei Zhu, Fangyun Wei, Yanye Lu.
[paper]
[code]
(arXiv2024_VAR) Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang.
[paper]
[code]
(arXiv2024_LlamaGen) Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan.
[paper]
[code]
(NeurIPS2024_LI-DiT) Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models.
Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu.
[paper]
(NeurIPS2024_MAR) Autoregressive Image Generation without Vector Quantization.
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He.
[paper]
[code]
(arXiv2024_RAR) Randomized Autoregressive Visual Generation.
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen.
[paper]
[code]
(arXiv2024_RandAR) RandAR: Decoder-only Autoregressive Visual Generation in Random Orders.
Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang.
[paper]
[code]
(arXiv2024_ACDiT) ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer.
Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun.
[paper]
[code]
(arXiv2024_LatentLM) Multimodal Latent Language Modeling with Next-Token Diffusion.
Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei.
[paper]
(Blog2024_DMFM) Diffusion Meets Flow Matching: Two Sides of the Same Coin.
DeepMind Team.
[blog]
[wechat]
(CVPR2022_LDM) High-Resolution Image Synthesis with Latent Diffusion Models.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
[paper]
[code]
(ICCV2023_ControlNet) Adding Conditional Control to Text-to-Image Diffusion Models.
Lvmin Zhang, Anyi Rao, Maneesh Agrawala.
[paper]
[code]
(CVPR2023_GigaGAN) Scaling up GANs for Text-to-Image Synthesis.
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park.
[paper]
[code]
(NeurIPS2023_GILL) Generating Images with Multimodal Language Models.
Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov.
[paper]
[code]
(arXiv2024_Emu) Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack.
GenAI Team.
[paper]
(CVPR2024_Powers-of-Ten) Generative Powers of Ten.
Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski.
[paper]
[code]
(arXiv2024_CoBSAT) Can MLLMs Perform Text-to-Image In-Context Learning?.
Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee.
[paper]
[code]
(arXiv2024_SD3) Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.
Stability AI.
[paper]
[code]
(ECCV2024_ZigMa) ZigMa: A DiT-style Zigzag Mamba Diffusion Model.
Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, Björn Ommer.
[paper]
[code]
(arXiv2024_Ctrl-Adapter) Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model.
Han Lin, Jaemin Cho, Abhay Zala, Mohit Bansal.
[paper]
[code]
(arXiv2024_Lumina-T2X) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.
Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li.
[paper]
[code]
(arXiv2024_CDF) Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion.
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann.
[paper]
[code]
(arXiv2024_MARS) MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis.
Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao Jiang.
[paper]
[code]
(arXiv2024_MELLE) Autoregressive Speech Synthesis without Vector Quantization.
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei.
[paper]
[code]
(arXiv2024_PGv3) Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models.
Playground Research.
[paper]
(arXiv2024_OmniGen) OmniGen: Unified Image Generation.
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu.
[paper]
[code]
(arXiv2024_ControlAR) ControlAR: Controllable Image Generation with Autoregressive Models.
Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang.
[paper]
[code]
(arXiv2024_ZipAR) ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality.
Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang.
[paper]
[code]
(arXiv2024_Infinity) Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis.
Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu.
[paper]
[code]
(arXiv2024_Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov.
[paper]
[demo]
(arXiv2024_Pyramid-Flow) Pyramidal Flow Matching for Efficient Video Generative Modeling.
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin.
[paper]
[demo]
(arXiv2024_Koala-36M) Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content.
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, Di Zhang.
[paper]
[demo]
(arXiv2024_Movie-Gen) Movie Gen: A Cast of Media Foundation Models.
Movie Gen team.
[paper]
[demo]
(arXiv2024_HunyuanVideo) HunyuanVideo: A Systematic Framework For Large Video Generative Models.
Hunyuan Foundation Model Team.
[paper]
[demo]
(arXiv2024_DiCoDe) DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models.
Yizhuo Li, Yuying Ge, Yixiao Ge, Ping Luo, Ying Shan.
[paper]
[demo]
(NeurIPS2023_CoDi) Any-to-Any Generation via Composable Diffusion.
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal.
[paper]
[code]
(ICLR2023_Emu) Generative Pretraining in Multimodality.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang.
[paper]
[code]
(ICLR2024_LaVIT) Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization.
Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, Di Zhang, Wenwu Ou, Kun Gai, Yadong Mu.
[paper]
[code]
(ICML2024_NExT-GPT) NExT-GPT: Any-to-Any Multimodal Large Language Model.
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua.
[paper]
[code]
(ICLR2024_DreamLLM) DreamLLM: Synergistic Multimodal Comprehension and Creation.
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, Li Yi.
[paper]
[code]
(ICLR2024_SEED) Making LLaMA SEE and Draw with SEED Tokenizer.
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan.
[paper]
[code]
(CVPR2024_OneLLM) OneLLM: One Framework to Align All Modalities with Language.
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue.
[paper]
[code]
(arXiv2023_VL-GPT) VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation.
Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan.
[paper]
[code]
(arXiv2023_Gemini) Gemini: A Family of Highly Capable Multimodal Models.
Gemini Team, Google.
[paper]
(CVPR2024_Emu2) Generative Multimodal Models are In-Context Learners.
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang.
[paper]
[code]
(arXiv2023_Unified-IO-2) Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action.
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi.
[paper]
[code]
(arXiv2024_MM-Interleaved) MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer.
Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai.
[paper]
[code]
(arXiv2024_Video-LaVIT) Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization.
Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu.
[paper]
[code]
(arXiv2024_LWM) World Model on Million-Length Video And Language With Blockwise RingAttention.
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel.
[paper]
[code]
(CVPR2024_AnyGPT) AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu.
[paper]
[code]
(arXiv2024_Mini-Gemini) Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models.
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia.
[paper]
[code]
(arXiv2024_SEED-X) SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation.
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan.
[paper]
[code]
(arXiv2024_Chameleon) Chameleon: Mixed-Modal Early-Fusion Foundation Models.
Chameleon Team.
[paper]
(arXiv2024_X-VILA) X-VILA: Cross-Modality Alignment for Large Language Model.
Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin.
[paper]
(NeurIPS2024_Vitron) Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing.
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan.
[paper]
[code]
(arXiv2024_ANOLE) ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation.
Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu.
[paper]
[code]
(arXiv2024_SEED-Story) SEED-Story: Multimodal Long Story Generation with Large Language Model.
Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen.
[paper]
[code]
(arXiv2024_Transfusion) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy.
[paper]
(arXiv2024_Show-o) Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou.
[paper]
[code]
(arXiv2024_VILA-U) VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation.
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu.
[paper]
(arXiv2024_MonoFormer) MonoFormer: One Transformer for Both Diffusion and Autoregression.
Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang.
[paper]
[code]
(arXiv2024_MIO) MIO: A Foundation Model on Multimodal Tokens.
Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang.
[paper]
(arXiv2024_Emu3) Emu3: Next-Token Prediction is All You Need.
Emu3 Team.
[paper]
[code]
(arXiv2024_MMAR) MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling.
Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha.
[paper]
[code]
(arXiv2024_Janus) Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation.
DeepSeek AI.
[paper]
[code]
(arXiv2024_MotionGPT-2) MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding.
Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Yan Zhou, Pengfei Wan, Shixiang Tang, Dan Xu.
[paper]
(arXiv2024_JanusFlow) JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation.
DeepSeek AI.
[paper]
[code]
(arXiv2024_Spider) Spider: Any-to-Many Multimodal LLM.
Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, Song Guo.
[paper]
(arXiv2024_MUSE-VL) MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding.
Rongchang Xie, Chen Du, Ping Song, Chang Liu.
[paper]
(arXiv2024_JetFormer) JetFormer: An Autoregressive Generative Model of Raw Images and Text.
Michael Tschannen, André Susano Pinto, Alexander Kolesnikov.
[paper]
(arXiv2024_Orthus) Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads.
Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng.
[paper]
(arXiv2024_OmniFlow) OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows.
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover.
[paper]
[code]
(arXiv2024_TokenFlow) TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation.
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu.
[paper]
[code]
(EMNLP2016_Seq-KD) Sequence-Level Knowledge Distillation.
Yoon Kim, Alexander M. Rush.
[paper]
[code]
(EMNLP2020_ImitKD) Autoregressive Knowledge Distillation through Imitation Learning.
Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei.
[paper]
[code]
(ICLR2013_Ensemble) Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning.
Zeyuan Allen-Zhu, Yuanzhi Li.
[paper]
(arXiv2022_Unified-KD) Knowledge Distillation of Transformer-based Language Models Revisited.
Chengqiang Lu, Jianwei Zhang, Yunfei Chu, Zhengyu Chen, Jingren Zhou, Fei Wu, Haiqing Chen, Hongxia Yang.
[paper]
(arXiv2022_MT-CoT) Explanations from Large Language Models Make Small Reasoners Better.
Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, Xifeng Yan.
[paper]
(ACL2023_SOCRATIC-CoT) Distilling Reasoning Capabilities into Smaller Language Models.
Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan.
[paper]
(ACL2023_FT-CoT) Large Language Models Are Reasoning Teachers.
Namgyu Ho, Laura Schmid, Se-Young Yun.
[paper]
(ACL2023_DISCO) DISCO: Distilling Counterfactuals with Large Language Models.
Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson.
[paper]
(arXiv2022_ICT-D) In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models.
Yukun Huang, Yanda Chen, Zhou Yu, Kathleen McKeown.
[paper]
(ICML2023_ModelSpecializing) Specializing Smaller Language Models towards Multi-Step Reasoning.
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot.
[paper]
(ACL2023_SCOTT) SCOTT: Self-Consistent Chain-of-Thought Distillation.
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren.
[paper]
[code]
(ACL2023_Distilling-step-by-step) Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes.
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister.
[paper]
(EMNLP2023_Lion) Lion: Adversarial Distillation of Proprietary Large Language Models.
Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei Wang.
[paper]
[code]
(ICLR2024_MINILLM) MINILLM: Knowledge Distillation of Large Language Models.
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang.
[paper]
[code]
(NeurIPS2023_KD) Propagating Knowledge Updates to LMs Through Distillation.
Shankar Padmanabhan, Yasumasa Onoe, Michael J.Q. Zhang, Greg Durrett, Eunsol Choi.
[paper]
[code]
(ICLR2024_GKD) On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes.
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem.
[paper]
[code]
(ACL2023_f-DISTILL) f-Divergence Minimization for Sequence-Level Knowledge Distillation.
Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou.
[paper]
[code]
(arXiv2023_BabyLlama) Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty.
Inar Timiryasov, Jean-Loup Tastet.
[paper]
[code]
(ICLR2024_DistillSpec) DistillSpec: Improving Speculative Decoding via Knowledge Distillation.
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal.
[paper]
(arXiv2023_MiniMA) Towards the Law of Capacity Gap in Distilling Language Models.
Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao.
[paper]
[code]
(ICML2024_Self-Rewarding) Self-Rewarding Language Models.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston.
[paper]
(arXiv2020_Survey) Efficient Transformers: A Survey.
Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler.
[paper]
(arXiv2023_Survey) A Survey of Large Language Models.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen.
[paper]
(arXiv2023_Survey) Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models.
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, Jianlong Chang, Qi Tian.
[paper]
(arXiv2023_Survey) A Survey on Multimodal Large Language Models.
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen.
[paper]
[code]
(arXiv2023_Survey) A Survey on Model Compression for Large Language Models.
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang.
[paper]
(arXiv2023_Survey) Multimodal Foundation Models: From Specialists to General-Purpose Assistants.
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao.
[paper]
[code]
(arXiv2023_Survey) A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future.
Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, Ting Liu.
[paper]
[code]
(CVPR2023w_Survey) Recent Advances in Vision Foundation Models.
Linjie Li, Zhe Gan, Chunyuan Li, Jianwei Yang, Zhengyuan Yang, Jianfeng Gao, Lijuan Wang.
[paper]
(arXiv2023_Survey) Efficient Large Language Models: A Survey.
Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang.
[paper]
[code]
(arXiv2023_Survey) A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise.
Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun.
[paper]
[code]
(NeurIPS2023_LAMM) LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark.
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang.
[paper]
[code]
(arXiv2023_MME) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji.
[paper]
[code]
(ECCV2024_MMBench) MMBench: Is Your Multi-modal Model an All-around Player?.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin.
[paper]
[code]
(arXiv2023_SEED-Bench) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension.
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan.
[paper]
[code]
(arXiv2023_MagnifierBench) OtterHD: A High-Resolution Multi-modality Model.
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu.
[paper]
[code]
(arXiv2023_Video-Bench) Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models.
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan.
[paper]
[code]
(arXiv2023_MVBench) MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao.
[paper]
[code]
(arXiv2023_SEED-Bench-2) SEED-Bench-2: Benchmarking Multimodal Large Language Models.
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan.
[paper]
[code]
(arXiv2023_VBench) VBench: Comprehensive Benchmark Suite for Video Generative Models.
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu.
[paper]
[code]
(arXiv2024_VL-ICL) VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning.
Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales.
[paper]
[code]
(arXiv2024_ConvBench) ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models.
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang.
[paper]
(ECCV2024_BLINK) BLINK: Multimodal Large Language Models Can See but Not Perceive.
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna.
[paper]
[code]
(arXiv2024_MMT-Bench) MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI.
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao.
[paper]
(arXiv2024_SEED-Bench-2-Plus) SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension.
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan.
[paper]
[code]
(arXiv2024_Video-MME) Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun.
[paper]
[code]