Skip to content

Hjopsen/Awesome-Multimodal-Spatio-Temporal-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Spatio-Temporal-LLMs

Awesome PR's Welcome visitors

🌱 How to participate in this awesome

You are welcome to add new multimodal works, fix errors, or make any other modifications that help make this awesome more useful or interesting. Click here to find the contribution tutorial. We promise that your pull requests will be processed within 24 hours. Thank you for your contributions.

⭐ Table of Content

1. Surveys

Paper Venue Time Link Notes
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Arxiv 2024.07 Arxiv categorise the knowledge mechanism of LLMs as knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs.
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models Arxiv 2024.05 Arxiv introduce background knowledge of the 3D field and the revolutionary changes that LLM has brought to the field
Large Multimodal Agents: A Survey Arxiv 2024.02 Arxiv 🕐 Coming soon...
The Revolution of Multimodal Large Language Models: A Survey ACL 2024 V1:2024.02 - V2:2024.06 Arxiv 🕐 Coming soon...
MM-LLMs: Recent Advances in MultiModal Large Language Models ACL 2024 V1:2024.01 - V5:2024.05 Arxiv 🕐 Coming soon...
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning Arxiv 2024.01 Arxiv 🕐 Coming soon...
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey Arxiv 2023.12 Arxiv 🕐 Coming soon...
Multimodal Large Language Models: A Survey BigData 2023 2023.11 Arxiv 🕐 Coming soon...
A Survey on Multimodal Large Language Models for Autonomous Driving WACV 2024 2023.11 Arxiv 🕐 Coming soon...
Multimodal Foundation Models: From Specialists to General-Purpose Assistants Arxiv 2023.09 Arxiv 🕐 Coming soon...
Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models Arxiv 2023.08 Arxiv 🕐 Coming soon...
A Survey on Multimodal Large Language Models Arxiv 2023.06 Arxiv discuss MLLM from four perspectives: Multimodal Instruction Tuning、Multimodal In-Context Learning、Multimodal Chain of Thought、LLM-Aided Visual Reasoning

2. Analysis

Title Time Link Notes
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2024.01 Paper The authors defined a "CLIP-blind pair" as two images that appear visually dissimilar but have very similar features according to CLIP's output. They also utilized GPT to summarize the characteristics of images that the model finds challenging to recognize.

3. Models

Name Time Modal Params Link Notes
Qwen2-VL 2024.08 Image
Video
Language
2B, 7B, 72B Github
PaliGemma 2024.07 Image
Language
3B Paper VL Large Model Focused on transfer learning
MM1 2024.03 Image
Language
3B, 7B, 30B Paper, Github Ablation experiments are performed on the model architecture decisions and pre-training data choices to determine the optimal configuration
MiniCPM-V 2024.02 Image
Language
2B, 8B Paper, Github Lightweight VL models focusing on end-side deployment
InternVL 2023.12 Image
Language
14B, 40B Paper
LLaVA 2023.04 Image
Language
V1: 7B, 13B
V1.5: 7B, 13B
V1.6: 7B, 13B, 34B
Page, Paper1, Paper2, Github is the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data and trains an end-to-end large multimodal model that connects a vision encoder and an LLM for generalpurpose visual and language understanding.

4. Datasets

Datasets Time Modal Scale Annotation Data sources Link Notes
FILIP300M 2021.11 Vision
Language
300M image-text pairs image-text pairs Internet Paper Removing the images whose shorter dimension is smaller than 200 pixels and the aspect ratio is larger than 3. Keeping only English texts, and excluding meaningless ones. Discarding image-text pairs whose texts are repeated over 10 times.

5. Benchmarks

Name Time Task Link
MM-Vet v2 2024.08 Recognition, Knowledge, OCR, Spatial awareness, Language generation, Math, image-text sequence understanding Paper, Github
Star
MMVP 2024.01 VQA for "CLIP-blind Pairs" Page, Paper, Github
Star
MM-Vet 2023.08 Recognition, Knowledge, OCR, Spatial awareness, Language generation, Math Paper, Github
Star
MME 2023.06 14 subtasks: Existence, Count, OCR, Poster, Celebrity, Commonsense Reasoning, Text Translation... Paper, Github
Star
Perception Test 2023.05 object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, grounded video question-answering Paper, Github
Star

6. Technologies

Name Link Notes
LoRA Paper By transforming the full-parameter optimization into the optimization of two low-rank matrices through low-rank decomposition, the memory usage during training is reduced.
Data Filtering Networks(DFN) Paper A CLIP model with high accuracy on downstream tasks is not necessarily a good data filtering model; a small amount of high-quality pre-training data is more important.

7. Other awesome

Name Link Scope
Awesome-LLM-Tabular Github Tabular, LLM
Star
Awesome-LLM-3D Github 3D, LLM
Star
Awesome-Multimodal-Large-Language-Models Github Multimodal, Dataset, LLM
Star
Awesome-Multimodal-Papers Github LargeModel, Benchmark, Task, Dataset
Star

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published