This repository includes resources on several applications of multi-modal learning in medical imaging, including papers related to large language models (LLM). Papers involving LLM are bold.
Please feel free to send me pull requests or email to add links or to discuss with me about this area. Markdown format:
- [**Name of Conference or Journal + Year**] Paper Name. [[pdf]](link) [[code]](link)
- [2025-01] 🔥We release a new paper on clinical-aware preference learning for Med-VLMs: "MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization" and 🎉 MMed-RAG was accepted at ICLR'25!
- [2024-10] 🔥🔥We release a new paper on using versatile multimodal RAG system for Med-VLMs: "MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models".
- [2024-09] 🎉🎉 CARES was accepted at NeurIPS'24, RULE was accepted at EMNLP'24 main conference!
- [2024-07] 🔥🔥We release a new paper on enhance the factuality of Med-VLMs with RAG: "RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models".
- [2024-06] 🔥🔥We release a new paper on evaluating Med-VLMs: "CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models".
- [2022-07] We create this repository to maintain a paper list on multimodal applications in medical imaging.
@article{xia2024cares,
title={CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models},
author={Xia, Peng and Chen, Ze and Tian, Juanxi and Gong, Yangrui and Hou, Ruibo and Xu, Yue and Wu, Zhenbang and Fan, Zhiyuan and Zhou, Yiyang and Zhu, Kangyu and others},
journal={arXiv preprint arXiv:2406.06007},
year={2024}
}
@inproceedings{xia2024rule,
title={RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models},
author={Xia, Peng and Zhu, Kangyu and Li, Haoran and Zhu, Hongtu and Li, Yun and Li, Gang and Zhang, Linjun and Yao, Huaxiu},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={1081--1093},
year={2024}
}
@article{xia2024mmed,
title={MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models},
author={Xia, Peng and Zhu, Kangyu and Li, Haoran and Wang, Tianze and Shi, Weijia and Wang, Sheng and Zhang, Linjun and Zou, James and Yao, Huaxiu},
journal={arXiv preprint arXiv:2410.13085},
year={2024}
}
@article{zhu2024mmedpo,
title={MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization},
author={Zhu, Kangyu and Xia, Peng and Li, Yun and Zhu, Hongtu and Wang, Sheng and Yao, Huaxiu},
journal={arXiv preprint arXiv:2412.06141},
year={2024}
}
- Data Source
- Survey
- Medical Report Generation
- Medical Visual Question Answering
- Medical Vision-Language Model
dataset | domain | image | text | source | language |
---|---|---|---|---|---|
ROCO | multiple | 87K | 87K | research papers | En |
MedICaT | multiple | 217K | 217K | research papers | En |
PMC-OA | multiple | 1.6M | 1.6M | research papers | En |
ChiMed-VL | multiple | 580K | 580K | research papers | En/zh |
FFA-IR | fundus | 1M | 10K | medical reports | En/zh |
PadChest | cxr | 160K | 109K | medical reports | Sp |
MIMIC-CXR | cxr | 377K | 227K | medical reports | En |
OpenPath | histology | 208K | 208K | social media | En |
Quilt-1M | histology | 1M | 1M | research papers social media |
En |
Harvard-FairVLMed | fundus | 10k | 10K | medical reports | En |
MedTrinity-25M | multiple | 25M | 25M | research papers social media |
En |
dataset | domain | image | QA Items | language |
---|---|---|---|---|
VQA-RAD | radiology | 315 | 3k | En |
SLAKE | radiology | 642 | 14k | En/zh |
Path-VQA | histology | 5k | 32M | En |
VQA-Med | radiology | 4.5k | 5.5k | En |
PMC-VQA | multiple | 149k | 227k | En |
OmniMedVQA | multiple | 118k | 128k | En |
ProbMed | radiology | 6k | 57k | En |
PubMedVision | multiple | 914k | 1.3M | En |
- [arXiv 2022] Visual Attention Methods in Deep Learning: An In-Depth Survey [pdf]
- [arXiv 2022] Vision+X: A Survey on Multimodal Learning in the Light of Data [pdf]
- [arXiv 2023] Vision Language Models for Vision Tasks: A Survey [pdf] [code]
- [arXiv 2023] A Systematic Review of Deep Learning-based Research on Radiology Report Generation [pdf] [code]
- [Artif Intell Med 2023] Medical Visual Question Answering: A Survey [pdf]
- [arXiv 2023] Medical Vision Language Pretraining: A survey [pdf]
- [arXiv 2023] CLIP in Medical Imaging: A Comprehensive Survey [pdf] [code]
- [arXiv 2024] Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review [pdf] [code]
- [arXiv 2024] A Survey of Medical Vision-and-Language Applications and Their Techniques [pdf] [code]
- [EMNLP 2018] Automated Generation of Accurate & Fluent Medical X-ray Reports [pdf] [code]
- [ACL 2018] On the Automatic Generation of Medical Imaging Reports [pdf] [code]
- [NeurIPS 2018] Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation [pdf]
- [AAAI 2019] Knowledge-Driven Encode, Retrieve, Paraphrase for Medical Image Report Generation [pdf]
- [ICDM 2019] Automatic Generation of Medical Imaging Diagnostic Report with Hierarchical Recurrent Neural Network [pdf]
- [MICCAI 2019] Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment [pdf]
- [AAAI 2020] When Radiology Report Generation Meets Knowledge Graph [pdf]
- [EMNLP 2020] Generating Radiology Reports via Memory-driven Transformer [pdf] [code]
- [ACCV 2020] Hierarchical X-Ray Report Generation via Pathology tags and Multi Head Attention [pdf] [code]
- [NeurIPS 2021] FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark [pdf] [code]
- [ACL 2021] Competence-based Multimodal Curriculum Learning for Medical Report Generation [pdf]
- [CVPR 2021] Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation [pdf]
- [MICCAI 2021] AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation [pdf]
- [NAACL 2021] Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation [pdf] [code]
- [MICCAI 2021] RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting [pdf][code]
- [MICCAI 2021] Trust It or Not: Confidence-Guided Automatic Radiology Report Generation [pdf]
- [MICCAI 2021] Surgical Instruction Generation with Transformers [pdf]
- [MICCAI 2021] Class-Incremental Domain Adaptation with Smoothing and Calibration for Surgical Report Generation [pdf] [code]
- [ACL 2021] Cross-modal Memory Networks for Radiology Report Generation [pdf] [code]
- [CVPR 2022] Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [pdf]
- [Nature Machine Intelligence 2022] Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports [pdf] [code]
- [MICCAI 2022] A Self-Guided Framework for Radiology Report Generation [pdf]
- [MICCAI 2022] A Medical Semantic-Assisted Transformer for Radiographic Report Generation [pdf]
- [MIDL 2022] Representative Image Feature Extraction via Contrastive Learning Pretraining for Chest X-ray Report Generation [pdf]
- [MICCAI 2022] RepsNet: Combining Vision with Language for Automated Medical Reports [pdf] [code]
- [ICML 2022] Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors [pdf]
- [TNNLS 2022] Hybrid Reinforced Medical Report Generation with M-Linear Attention and Repetition Penalty [pdf]
- [MedIA 2022] CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation [pdf]
- [MedIA 2022] Knowledge matters: Chest radiology report generation with general and specific knowledge [pdf] [code]
- [MICCAI 2022] Lesion Guided Explainable Few Weak-shot Medical Report Generation [pdf] [code]
- [BMVC 2022] On the Importance of Image Encoding in Automated Chest X-Ray Report Generation [pdf] [code]
- [arXiv 2022] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [pdf]
- [COLING 2022] DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis [pdf] [code]
- [ECCV 2022] Cross-modal Prototype Driven Network for Radiology Report Generation [pdf] [code]
- [ICIP 2023] Self adaptive global-local feature enhancement for radiology report generation [pdf]
- [TMI 2023] Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation [pdf]
- [arXiv 2023] Unified Chest X-ray and Radiology Report Generation Model with Multi-view Chest X-rays [pdf] [code]
- [WWW 2023] Auxiliary signal-guided knowledge encoder-decoder for medical report generation [pdf]
- [CVPR 2023] Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation [pdf] [code]
- [CVPR 2023] KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation [pdf]
- [CVPR 2023] Interactive and Explainable Region-guided Radiology Report Generation [pdf] [code]
- [MIDL 2023] Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation [pdf] [code]
- [arXiv 2023] Visual-Linguistic Causal Intervention for Radiology Report Generation [pdf] [code]
- [MIDL 2023] Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [pdf]
- [arXiv 2023] Cross-Modal Causal Intervention for Medical Report Generation [pdf] [code]
- [ICASSP 2023] MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical Report Generation [pdf]
- [CHIL 2023] Token Imbalance Adaptation for Radiology Report Generation [pdf] [code]
- [AAAI 2023] "Nothing Abnormal": Disambiguating Medical Reports via Contrastive Knowledge Infusion [pdf] [code]
- [arXiv 2023] S4M: Generating Radiology Reports by A Single Model for Multiple Body Parts [pdf] [code]
- [CVPR 2023] KiUT: Knowledge-injected U-Transformer for Radiology Report Generation [pdf]
- [ACL 2023] Replace and Report: NLP Assisted Radiology Report Generation [pdf]
- [ICCV 2023] PRIOR: Prototype Representation Joint Learning from Medical Images and Reports [pdf] [code]
- [ICMLW 2023] Rethinking Medical Report Generation: Disease Revealing Enhancement with Knowledge Graph [pdf] [code]
- [MICCAI 2023] Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting [pdf] [code]
- [MLMIW 2023] Finding-Aware Anatomical Tokens for Chest X-Ray Automated Reporting [pdf]
- [MedIA 2023] C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network [pdf]
- [EMNLP 2023 Findings] Controllable Chest X-Ray Report Generation from Longitudinal Representations [pdf]
- [BIBM 2023] Enhanced Knowledge Injection for Radiology Report Generation [pdf]
- [EMNLP 2023 Findings] Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting [pdf]
- [ACL 2023] ORGAN: Observation-Guided Radiology Report Generation via Tree-Reasoning [pdf] [code]
- [EMNLP 2023 Findings] RECAP: Towards Precise Radiology Report Generation via Dynamic Disease Progression Reasoning [pdf] [code]
- [NeurIPSW 2023] Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation [pdf]
- [arXiv 2023] Radiology-Aware Model-Based Evaluation Metric for Report Generation [pdf]
- [EMNLP 2023] PhenotypeCLIP: Phenotype-based Contrastive Learning for Medical Imaging Report Generation [pdf]
- [arXiv 2023] Fine-Grained Image-Text Alignment in Medical Imaging Enables Cyclic Image-Report Generation [pdf]
- [arXiv 2023] Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models [pdf]
- [NLPCC 2023] Medical Report Generation based on Segment-Enhanced Contrastive Representation Learning [pdf]
- [MICCAI 2023] SGT: Scene Graph-Guided Transformer for Surgical Report Generation [pdf] [code]
- [ICASSP 2024] Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning [pdf] [code]
- [AAAI 2024] PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation [pdf] [code]
- [WACV 2024] Complex Organ Mask Guided Radiology Report Generation [pdf] [code]
- [TMM 2024] From Observation to Concept: A Flexible Multi-view Paradigm for Medical Report Generation [pdf]
- [TMI 2024] SGT++: Improved Scene Graph-guided Transformer for Surgical Report Generation [pdf]
- [arXiv 2024] Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation [pdf]
- [arXiv 2024] Dual-modal Dynamic Traceback Learning for Medical Report Generation [pdf]
- [arXiv 2024] MedCycle: Unpaired Medical Report Generation via Cycle-Consistency [pdf]
- [arXiv 2024] Scene Graph Aided Radiology Report Generation [pdf]
- [ACL 2024 Findings] Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation [pdf] [code]
- [arXiv 2024] TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Models [pdf]
- [arXiv 2021] MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering [pdf]
- [Scientific Reports 2021] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain [pdf]
- [MICCAI 2022] Consistency-preserving Visual Question Answering in Medical Imaging [pdf] [code]
- [MICCAI 2022] Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer [pdf] [code]
- [ECCV 2022] Distilled Dual-Encoder Model for Vision-Language Understanding [pdf] [code]
- [arXiv 2022] UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering [pdf]
- [TMI 2023] A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering [pdf] [code]
- [ISBI 2023] MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering [pdf]
- [ISBI 2023] Self-supervised vision-language pretraining for Medical visual question answering [pdf] [code]
- [arXiv 2023] Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning [pdf]
- [MM 2023] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [pdf] [code]
- [IPMI 2023] Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder [pdf]
- [MICCAI 2023] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models [pdf] [code]
- [arXiv 2023] PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [pdf] [code]
- [MICCAI 2023] Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering [pdf] [code]
- [MICCAI 2023] Localized Questions in Medical Visual Question Answering [pdf] [code]
- [arXiv 2023] Multimodal Prompt Retrieval for Generative Visual Question Answering [pdf] [code]
- [KDD 2023] Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering [pdf] [code]
- [NeurIPS 2023 D&B] EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images [pdf] [code]
- [MICCAI 2023] Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting [pdf] [code]
- [arXiv 2023] BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering [pdf] [demo]
- [NeurIPS 2023] Quilt-1m: One million image-text pairs for histopathology [pdf] [code-demo]
- [arXiv 2024] MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis [pdf] [code]
- [arXiv 2024] PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering [pdf] [code]
- [ICASSP 2024] Prompt-based Personalized Federated Learning for Medical Visual Question Answering [pdf]
- [arXiv 2024] RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning [pdf]
- [arXiv 2024] Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training [pdf]
- [arXiv 2024] Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA [pdf] [code]
- [IF 2024] Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery [pdf] [code]
- [EMNLP 2022] Medclip: Contrastive learning from unpaired medical images and text [pdf] [code]
- [NeurIPSW 2022] Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains [pdf]
- [ACL 2022] ViLMedic: a framework for research at the intersection of vision and language in medical AI [pdf] [code]
- [MICCAI 2022] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training [pdf] [code]
- [JBHI 2022] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training [pdf] [code]
- [AAAI 2022] Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation [pdf]
- [JBHI 2022] Vision-language transformer for interpretable pathology visual question answering [link]
- [arXiv 2022] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [pdf]
- [ECCV 2022] Making the most of text semantics to improve biomedical vision–language processing [pdf]
- [MICCAI 2022] RepsNet: Combining Vision with Language for Automated Medical Reports [pdf] [code]
- [NeurIPS 2022] Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning [pdf] [code]
- [MICCAI 2022] Berthop: An effective vision-and-language model for chest x-ray disease diagnosis [pdf]
- [TMI 2023] LViT: Language meets Vision Transformer in Medical Image Segmentation [pdf] [code]
- [ICCV 2023] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [pdf] [code]
- [ICCV 2023] CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection [pdf] [code]
- [arXiv 2023] Towards General Purpose Medical AI: Continual Learning Medical Foundation Model [pdf]
- [arXiv 2023] Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing [pdf] [code]
- [ICLR 2023] Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study [pdf] [code]
- [ICLR 2023] Advancing Radiograph Representation Learning with Masked Record Modeling [pdf] [code]
- [MICCAI 2023] PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents [pdf]
- [arXiv 2023] ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models [pdf][code]
- [ICCV 2023] MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training [pdf] [project]
- [CVPR 2023] Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [pdf]
- [CVPRW 2023] One-shot and Partially-Supervised Cell Image Segmentation Using Small Visual Prompt [pdf]
- [MICCAI 2023] CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction [pdf]
- [MICCAI 2023] UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong Representation Learner [pdf] [code]
- [ICCV 2023] UniverSeg: Universal Medical Image Segmentation [pdf] [project website]
- [ICCV 2023] LIMITR: Leveraging Local Information for Medical Image-Text Representation [pdf] [code]
- [arXiv 2023] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [pdf] [code]
- [arXiv 2023] BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks [pdf] [code]
- [CHIL 2023] Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark [pdf] [code]
- [NeurIPS 2023] Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias [pdf]
- [arXiv 2023] OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue [pdf] [code]
- [ICMLW 2023] A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis [pdf]
- [MICCAI 2023] M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization [pdf] [code]
- [arXiv 2023] Towards Generalist Biomedical AI [pdf] [Med-PaLM]
- [MICCAI 2023] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training [pdf] [code]
- [MICCAI 2023] Unified Medical Image-Text-Label Contrastive Learning With Continuous Prompt [pdf]
- [arXiv 2023] Few-shot medical image classification with simple shape and texture text descriptors using vision-language models [pdf] [code]
- [ICMLW 2023] Med-Flamingo: a Multimodal Medical Few-shot Learner [pdf] [code]
- [MICCAI 2023] Ariadne's Thread: Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray images [pdf] [code]
- [arXiv 2023] A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision [pdf] [code]
- [ICCV 2023] ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data [pdf] [code]
- [arXiv 2023] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training [pdf]
- [arXiv 2023] Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images [pdf]
- [arXiv 2023] RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance [pdf] [code]
- [MICCAI 2023] CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training [pdf] [code]
- [MICCAI 2023] Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment [pdf] [code]
- [arXiv 2023] BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys [pdf] [project]
- [arXiv 2023] Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare [pdf] [code]
- [NeurIPS 2023] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [pdf] [code]
- [arXiv 2023] Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data [pdf] [code]
- [arXiv 2023] RO-LLaMA: Generalist LLM for Radiation Oncology via Noise Augmentation and Consistency Regularization [pdf]
- [arXiv 2023] MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model [pdf]
- [arXiv 2023] G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training [pdf]
- [npj digital medicine 2023] A medical multimodal large language model for future pandemics [pdf]
- [arXiv 2023] A Foundational Multimodal Vision Language AI Assistant for Human Pathology [pdf]
- [arXiv 2023] ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training [pdf] [code]
- [Nature Medicine 2023] A visual–language foundation model for pathology image analysis using medical Twitter [pdf] [code]
- [PAKDD 2023] Cascaded Latent Diffusion Models for High-Resolution Chest X-ray Synthesis [pdf] [code]
- [CVPR 2024] Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos [pdf] [code]
- [ICASSP 2024] Freeze the backbones: A Parameter-Efficient Contrastive Approach to Robust Medical Vision-Language Pre-training [pdf]
- [arXiv 2024] Vulnerabilities Unveiled: Adversarially Attacking a Multimodal Vision Language Model for Pathology Imaging [pdf]
- [arXiv 2024] Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval [pdf]
- [arXiv 2024] CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation [pdf] [code]
- [TMM 2024] UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts [pdf]
- [CVPR 2024] OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [pdf]
- [CVPR 2024] Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [pdf] [code]
- [ICLR 2024] LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [pdf] [code]
- [arXiv 2024] Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns [pdf]
- [arXiv 2024] DeViDe: Faceted medical knowledge for improved medical vision-language pre-training [pdf]
- [arXiv 2024] M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models [pdf] [code]
- [arXiv 2024] Dia-LLaMA: Towards Large Language Model-driven CT Report Generation [pdf]
- [arXiv 2024] WoLF: Wide-scope Large Language Model Framework for CXR Understanding [pdf]
- [CVPR 2024] Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework [pdf] [code]
- [arXiv 2024] Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning [pdf]
- [arXiv 2024] MedRG: Medical Report Grounding with Multi-modal Large Language Model [pdf]
- [CVPR 2024] Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models [pdf] [code]
- [CVPR 2024] Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning [pdf] [code]
- [CVPR 2024] PairAug: What Can Augmented Image-Text Pairs Do for Radiology? [pdf] [code]
- [CVPR 2024] MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [pdf]
- [Nature Medicine 2024] A visual-language foundation model for computational pathology [pdf] [code]
- [Nature Medicine 2024] Vision–language foundation model for echocardiogram interpretation [pdf] [code]
- [TMI 2024] ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs [pdf][code]
- [arXiv 2024] MedDr: Diagnosis-Guided Bootstrapping for Large-Scale Medical Vision-Language Learning [pdf] [code]
- [NeurIPS 2024] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models [pdf] [code]
- [MIDL 2024] Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [pdf] [code]
- [arXiv 2024] Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [pdf] [code]
- [arXiv 2024] Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [pdf] [code]
- [arXiv 2024] Merlin: A Vision Language Foundation Model for 3D Computed Tomography [pdf]
- [arXiv 2024] Advancing High Resolution Vision-Language Models in Biomedicine [pdf] [code]
- [EMNLP 2024] HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale [pdf] [code]
- [EMNLP 2024] STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical [pdf] [code]
- [EMNLP 2024] RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models [pdf] [code]
- [MICCAI 2024] CLIP-DR: Textual Knowledge-Guided Diabetic Retinopathy Grading with Ranking-aware Prompting [pdf] [code]
- [arXiv 2024] PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding [pdf] [code]
- [arXiv 2024] LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning [pdf]
- [NeurIPS 2024] GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [pdf] [code]
- [arXiv 2024] VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge [pdf] [code]
- [arXiv 2024] GP-VLS: A general-purpose vision language model for surgery [pdf] [code]
- [arXiv 2024] Specialist vision-language models for clinical ophthalmology [pdf]
- [arXiv 2024] MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis [pdf] [code]
- [arXiv 2024] MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [pdf] [code]
- [arXiv 2024] Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm [pdf]
- [arXiv 2024] LOGRA-MED: Long Context Multi-Graph Alignment For Medical Vision-Language Model [pdf]
- [arXiv 2024] WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation [pdf] [code]
- [arXiv 2024] VividMed: Vision Language Model with Versatile Visual Grounding for Medicine [pdf] [code]
- [arXiv 2024] Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback [pdf]
- [arXiv 2024] MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation [pdf]
- [arXiv 2024] Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks [pdf] [code]
- [arXiv 2024] E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model [pdf]
- [NeurIPS 2024] BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays [pdf] [code]
- [EMNLP 2024] Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? [pdf] [code]
- [arXiv 2024] SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation [pdf] [code]
- [arXiv 2024] Training Medical Large Vision-Language Models with Abnormal-Aware Feedback [pdf]
- [arXiv 2024] MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization [pdf] [code]
- [arXiv 2024] Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation [pdf]
- [arXiv 2024] VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge [pdf]
- [arXiv 2024] GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI [pdf] [code]
- [NeurIPS 2024] Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement [pdf] [code]
- [arXiv 2024] Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks [pdf] [code]
- [AAAI 2025] Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine [pdf] [code]
- [AAAI 2025] KPL: Training-Free Medical Knowledge Mining of Vision-Language Models [pdf] [code]
- [ICLR 2025] MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [pdf] [code]
- [ICLR 2025] MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [pdf] [code]
⭐" Join us in improving this repository! If you know of any important works we've missed, please contribute. Your efforts are highly valued! "