You are welcome to add new multimodal works, fix errors, or make any other modifications that help make this awesome more useful or interesting. Click here to find the contribution tutorial. We promise that your pull requests will be processed within 24 hours. Thank you for your contributions.
Paper | Venue | Time | Link | Notes |
---|---|---|---|---|
Knowledge Mechanisms in Large Language Models: A Survey and Perspective | Arxiv | 2024.07 | Arxiv | categorise the knowledge mechanism of LLMs as knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. |
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models | Arxiv | 2024.05 | Arxiv | introduce background knowledge of the 3D field and the revolutionary changes that LLM has brought to the field |
Large Multimodal Agents: A Survey | Arxiv | 2024.02 | Arxiv | 🕐 Coming soon... |
The Revolution of Multimodal Large Language Models: A Survey | ACL 2024 | V1:2024.02 - V2:2024.06 | Arxiv | 🕐 Coming soon... |
MM-LLMs: Recent Advances in MultiModal Large Language Models | ACL 2024 | V1:2024.01 - V5:2024.05 | Arxiv | 🕐 Coming soon... |
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning | Arxiv | 2024.01 | Arxiv | 🕐 Coming soon... |
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey | Arxiv | 2023.12 | Arxiv | 🕐 Coming soon... |
Multimodal Large Language Models: A Survey | BigData 2023 | 2023.11 | Arxiv | 🕐 Coming soon... |
A Survey on Multimodal Large Language Models for Autonomous Driving | WACV 2024 | 2023.11 | Arxiv | 🕐 Coming soon... |
Multimodal Foundation Models: From Specialists to General-Purpose Assistants | Arxiv | 2023.09 | Arxiv | 🕐 Coming soon... |
Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models | Arxiv | 2023.08 | Arxiv | 🕐 Coming soon... |
A Survey on Multimodal Large Language Models | Arxiv | 2023.06 | Arxiv | discuss MLLM from four perspectives: Multimodal Instruction Tuning、Multimodal In-Context Learning、Multimodal Chain of Thought、LLM-Aided Visual Reasoning |
Title | Time | Link | Notes |
---|---|---|---|
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | 2024.01 | Paper | The authors defined a "CLIP-blind pair" as two images that appear visually dissimilar but have very similar features according to CLIP's output. They also utilized GPT to summarize the characteristics of images that the model finds challenging to recognize. |
Name | Time | Modal | Params | Link | Notes |
---|---|---|---|---|---|
Qwen2-VL | 2024.08 | Image Video Language |
2B, 7B, 72B | Github | |
PaliGemma | 2024.07 | Image Language |
3B | Paper | VL Large Model Focused on transfer learning |
MM1 | 2024.03 | Image Language |
3B, 7B, 30B | Paper, Github | Ablation experiments are performed on the model architecture decisions and pre-training data choices to determine the optimal configuration |
MiniCPM-V | 2024.02 | Image Language |
2B, 8B | Paper, Github | Lightweight VL models focusing on end-side deployment |
InternVL | 2023.12 | Image Language |
14B, 40B | Paper | |
LLaVA | 2023.04 | Image Language |
V1: 7B, 13B V1.5: 7B, 13B V1.6: 7B, 13B, 34B |
Page, Paper1, Paper2, Github | is the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data and trains an end-to-end large multimodal model that connects a vision encoder and an LLM for generalpurpose visual and language understanding. |
Datasets | Time | Modal | Scale | Annotation | Data sources | Link | Notes |
---|---|---|---|---|---|---|---|
FILIP300M | 2021.11 | Vision Language |
300M image-text pairs | image-text pairs | Internet | Paper | Removing the images whose shorter dimension is smaller than 200 pixels and the aspect ratio is larger than 3. Keeping only English texts, and excluding meaningless ones. Discarding image-text pairs whose texts are repeated over 10 times. |
Name | Time | Task | Link |
---|---|---|---|
MM-Vet v2 | 2024.08 | Recognition, Knowledge, OCR, Spatial awareness, Language generation, Math, image-text sequence understanding | Paper, Github |
MMVP | 2024.01 | VQA for "CLIP-blind Pairs" | Page, Paper, Github |
MM-Vet | 2023.08 | Recognition, Knowledge, OCR, Spatial awareness, Language generation, Math | Paper, Github |
MME | 2023.06 | 14 subtasks: Existence, Count, OCR, Poster, Celebrity, Commonsense Reasoning, Text Translation... | Paper, Github |
Perception Test | 2023.05 | object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering | Paper, Github |
Name | Link | Notes |
---|---|---|
LoRA | Paper | By transforming the full-parameter optimization into the optimization of two low-rank matrices through low-rank decomposition, the memory usage during training is reduced. |
Data Filtering Networks(DFN) | Paper | A CLIP model with high accuracy on downstream tasks is not necessarily a good data filtering model; a small amount of high-quality pre-training data is more important. |
Name | Link | Scope |
---|---|---|
Awesome-LLM-Tabular | Github | Tabular, LLM |
Awesome-LLM-3D | Github | 3D, LLM |
Awesome-Multimodal-Large-Language-Models | Github | Multimodal, Dataset, LLM |
Awesome-Multimodal-Papers | Github | LargeModel, Benchmark, Task, Dataset |