Article Source

Awesome-Multimodal-Large-Language-Models

Multimodal Large Language Models

Table of Contents

Awesome Papers
Awesome Datasets

Awesome Papers

Multimodal Instruction Tuning

Title	Venue	Date	Code	Demo
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	arXiv	2024-10-22	Github	Demo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models	arXiv	2024-09-25	Huggingface	Demo
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	arXiv	2024-09-18	Github	Demo
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture	arXiv	2024-09-04	Github	-
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	arXiv	2024-08-28	Github	Demo
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	arXiv	2024-08-09	Github	-
VITA: Towards Open-Source Interactive Omni Multimodal LLM	arXiv	2024-08-09	Github	-
LLaVA-OneVision: Easy Visual Task Transfer	arXiv	2024-08-06	Github	Demo
MiniCPM-V: A GPT-4V Level MLLM on Your Phone	arXiv	2024-08-03	Github	Demo
VILA^2: VILA Augmented VILA	arXiv	2024-07-24	-	-
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	arXiv	2024-07-22	-	-
EVLM: An Efficient Vision-Language Model for Visual Understanding	arXiv	2024-07-19	-	-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03	Github	Demo
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	arXiv	2024-06-27	Github	Local Demo
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	arXiv	2024-06-24	Github	Local Demo
Long Context Transfer from Language to Vision	arXiv	2024-06-24	Github	Local Demo
Unveiling Encoder-Free Vision-Language Models	arXiv	2024-06-17	Github	Local Demo
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	CoRL	2024-06-15	Github	Demo
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	arXiv	2024-06-12	Github	-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11	Github	Local Demo
Parrot: Multilingual Visual Instruction Tuning	arXiv	2024-06-04	Github	-
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	arXiv	2024-05-31	Github	-
Matryoshka Query Transformer for Large Vision-Language Models	arXiv	2024-05-29	Github	Demo
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24	Github	-
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	arXiv	2024-05-24	Github	Demo
Libra: Building Decoupled Vision System on Large Language Models	ICML	2024-05-16	Github	Local Demo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09	Github	Local Demo
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites	arXiv	2024-04-25	Github	Demo
Graphic Design with Large Multimodal Model	arXiv	2024-04-22	Github	-
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	arXiv	2024-04-09	Github	Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR	2024-04-08	Github	-
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model	ACM TKDD	2024-03-28	-	-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-03-27	Github	Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	arXiv	2024-03-14	-	-
MoAI: Mixture of All Intelligence for Large Language and Vision Models	arXiv	2024-03-12	Github	Local Demo
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	arXiv	2024-03-07	Github	Demo
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	arXiv	2024-02-29	Github	-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	CVPR	2024-02-26	Coming soon	Coming soon
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19	Github	-
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning	arXiv	2024-02-18	Github	-
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	Github	Demo
CoLLaVO: Crayon Large Language and Vision mOdel	arXiv	2024-02-17	Github	-
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations	arXiv	2024-02-06	Github	-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06	Github	-
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study	arXiv	2024-01-31	Coming soon	-
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge	Blog	2024-01-30	Github	Demo
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	arXiv	2024-01-29	Github	Demo
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	arXiv	2024-01-29	Github	Demo
Yi-VL	-	2024-01-23	Github	Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	arXiv	2024-01-22	-	-
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices	arXiv	2023-12-28	Github	-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR	2023-12-21	Github	Demo
Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR	2023-12-15	Github	Demo
CogAgent: A Visual Language Model for GUI Agents	arXiv	2023-12-14	Github	Coming soon
Pixel Aligned Language Models	arXiv	2023-12-14	Coming soon	-
See, Say, and Segment: Teaching LMMs to Overcome False Premises	arXiv	2023-12-13	Coming soon	-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	ECCV	2023-12-11	Github	Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR	2023-12-11	Github	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
OneLLM: One Framework to Align All Modalities with Language	arXiv	2023-12-06	Github	Demo
Lenna: Language Enhanced Reasoning Detection Assistant	arXiv	2023-12-05	Github	-
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding	arXiv	2023-12-04	-	-
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	arXiv	2023-12-04	Github	Local Demo
Making Large Multimodal Models Understand Arbitrary Visual Prompts	CVPR	2023-12-01	Github	Demo
Dolphins: Multimodal Language Model for Driving	arXiv	2023-12-01	Github	-
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	arXiv	2023-11-30	Github	Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments	arXiv	2023-11-30	Github	Local Demo
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	arXiv	2023-11-30	Github	-
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28	Github	Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27	Github	Demo
ChartLlama: A Multimodal LLM for Chart Understanding and Generation	arXiv	2023-11-27	Github	-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21	Github	Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	arXiv	2023-11-20	Github	-
An Embodied Generalist Agent in 3D World	arXiv	2023-11-18	Github	Demo
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16	Github	Demo
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	CVPR	2023-11-14	Github	-
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	arXiv	2023-11-13	Github	-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	arXiv	2023-11-13	Github	Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	CVPR	2023-11-11	Github	Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	arXiv	2023-11-09	Github	Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation	arXiv	2023-11-08	Github	Local Demo
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07	Github	Demo
OtterHD: A High-Resolution Multi-modality Model	arXiv	2023-11-07	Github	-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	arXiv	2023-11-06	Coming soon	-
GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	arXiv	2023-11-02	Github	-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14	Github	Local Demo
Ferret: Refer and Ground Anything Anywhere at Any Granularity	arXiv	2023-10-11	Github	-
CogVLM: Visual Expert For Large Language Models	arXiv	2023-10-09	Github	Demo
Improved Baselines with Visual Instruction Tuning	arXiv	2023-10-05	Github	Demo
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR	2023-10-03	Github	Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	arXiv	2023-10-01	Github	-
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants	arXiv	2023-10-01	Github	Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	arXiv	2023-09-27	-	-
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	arXiv	2023-09-26	Github	Local Demo
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR	2023-09-20	Github	Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models	arXiv	2023-09-18	Coming soon	-
TextBind: Multi-turn Interleaved Multimodal Instruction-following	arXiv	2023-09-14	Github	Demo
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11	Github	Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics	arXiv	2023-09-13	Github	-
ImageBind-LLM: Multi-modality Instruction Tuning	arXiv	2023-09-07	Github	Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	arXiv	2023-09-05	-	-
PointLLM: Empowering Large Language Models to Understand Point Clouds	arXiv	2023-08-31	Github	Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github	Local Demo
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	arXiv	2023-08-25	Github	-
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	arXiv	2023-08-25	Github	Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities	arXiv	2023-08-24	Github	Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	ICLR	2023-08-23	Github	Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	arXiv	2023-08-20	Github	-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions	arXiv	2023-08-19	Github	Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	arXiv	2023-08-08	Github	-
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	ICLR	2023-08-03	Github	Demo
LISA: Reasoning Segmentation via Large Language Model	arXiv	2023-08-01	Github	Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	arXiv	2023-07-31	Github	Local Demo
3D-LLM: Injecting the 3D World into Large Language Models	arXiv	2023-07-24	Github	-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	arXiv	2023-07-18	-	Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	arXiv	2023-07-17	Github	Demo
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Github	-
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	arXiv	2023-07-07	Github	Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	arXiv	2023-07-05	Github	-
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	arXiv	2023-07-04	Github	Demo
Visual Instruction Tuning with Polite Flamingo	arXiv	2023-07-03	Github	Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	arXiv	2023-06-29	Github	Demo
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
MotionGPT: Human Motion as a Foreign Language	arXiv	2023-06-26	Github	-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	arXiv	2023-06-15	Github	Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github	Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	arXiv	2023-06-08	Github	Demo
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-	-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	arXiv	2023-06-05	Github	Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	arXiv	2023-06-01	Github	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
PandaGPT: One Model To Instruction-Follow Them All	arXiv	2023-05-25	Github	Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	arXiv	2023-05-25	Github	-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	arXiv	2023-05-24	Github	Local Demo
DetGPT: Detect What You Need via Reasoning	arXiv	2023-05-23	Github	Demo
Pengi: An Audio Language Model for Audio Tasks	NeurIPS	2023-05-19	Github	-
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	arXiv	2023-05-18	Github	-
Listen, Think, and Understand	arXiv	2023-05-18	Github	Demo
VisualGLM-6B	-	2023-05-17	Github	Local Demo
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	arXiv	2023-05-17	Github	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	arXiv	2023-05-11	Github	Local Demo
VideoChat: Chat-Centric Video Understanding	arXiv	2023-05-10	Github	Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08	Github	Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	arXiv	2023-05-07	Github	-
LMEye: An Interactive Perception Network for Large Language Models	arXiv	2023-05-05	Github	Local Demo
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	arXiv	2023-04-28	Github	Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27	Github	Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	arXiv	2023-04-20	Github	-
Visual Instruction Tuning	NeurIPS	2023-04-17	GitHub	Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	ICLR	2023-03-28	Github	Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	ACL	2022-12-21	Github	-

Multimodal Hallucination

Title	Venue	Date	Code	Demo
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models	arXiv	2024-10-04	Github	-
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations	arXiv	2024-10-03	Github	-
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs	arXiv	2024-09-20	Link	-
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation	arXiv	2024-08-01	-	-
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs	ECCV	2024-07-31	Github	-
Evaluating and Analyzing Relationship Hallucinations in LVLMs	ICML	2024-06-24	Github	-
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention	arXiv	2024-06-18	Github	-
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models	arXiv	2024-06-04	Coming soon	-
VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap	arXiv	2024-05-24	Coming soon	-
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback	arXiv	2024-04-22	-	-
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding	arXiv	2024-03-27	-	-
What if…?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models	arXiv	2024-03-20	Github	-
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization	arXiv	2024-03-13	-	-
Debiasing Multimodal Large Language Models	arXiv	2024-03-08	Github	-
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding	arXiv	2024-03-01	Github	-
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding	arXiv	2024-02-28	-	-
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective	arXiv	2024-02-22	Github	-
Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models	arXiv	2024-02-18	Github	-
The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs	arXiv	2024-02-06	Github	-
Unified Hallucination Detection for Multimodal Large Language Models	arXiv	2024-02-05	Github	-
A Survey on Hallucination in Large Vision-Language Models	arXiv	2024-02-01	-	-
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models	arXiv	2024-01-18	-	-
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model	arXiv	2023-12-12	Github	-
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12-06	Github	-
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites	arXiv	2023-12-04	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	CVPR	2023-11-29	Github	-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	CVPR	2023-11-28	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11-28	Github	Comins Soon
Mitigating Hallucination in Visual Language Models with Visual Supervision	arXiv	2023-11-27	-	-
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data	arXiv	2023-11-22	Github	-
An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation	arXiv	2023-11-13	Github	-
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models	arXiv	2023-11-02	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models	arXiv	2023-10-09	-	-
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption	arXiv	2023-10-03	Github	-
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	ICLR	2023-10-01	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models	arXiv	2023-09-07	-	-
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning	arXiv	2023-09-05	-	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	arXiv	2023-08-29	Github	-
VIGC: Visual Instruction Generation and Correction	arXiv	2023-08-24	Github	Demo
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08-11	-	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	ICLR	2023-06-26	Github	Demo
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05-17	Github	-

Multimodal In-Context Learning

Title	Venue	Date	Code	Demo
Visual In-Context Learning for Large Vision-Language Models	arXiv	2024-02-18	-	-
Can MLLMs Perform Text-to-Image In-Context Learning?	arXiv	2024-02-02	Github	-
Generative Multimodal Models are In-Context Learners	CVPR	2023-12-20	Github	Demo
Hijacking Context in Large Multi-modal Models	arXiv	2023-12-07	-	-
Towards More Unified In-context Visual Understanding	arXiv	2023-12-05	-	-
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	arXiv	2023-09-14	Github	Demo
Link-Context Learning for Multimodal LLMs	arXiv	2023-08-15	Github	Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	arXiv	2023-08-02	Github	Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner	arXiv	2023-07-27	Github	Local Demo
Generative Pretraining in Multimodality	ICLR	2023-07-11	Github	Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
Exploring Diverse In-Context Configurations for Image Captioning	NeurIPS	2023-05-24	Github	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	Github	-
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	CVPR	2023-03-03	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA	AAAI	2022-06-28	Github	-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Demo
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021-06-25	-	-

Multimodal Chain-of-Thought

Title	Venue	Date	Code	Demo
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM	arXiv	2024-04-24	Github	Local Demo
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models	arXiv	2024-03-25	Github	Local Demo
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models	NeurIPS	2023-10-25	Github	-
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
Explainable Multimodal Emotion Reasoning	arXiv	2023-06-27	Github	-
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	arXiv	2023-05-24	Github	-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	arXiv	2023-05-23	-	-
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering	arXiv	2023-05-05	-	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings	arXiv	2023-05-03	Coming soon	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
Chain of Thought Prompt Tuning in Vision Language Models	arXiv	2023-04-16	Coming soon	-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-09-20	Github	-

LLM-Aided Visual Reasoning

Title	Venue	Date	Code	Demo
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models	arXiv	2024-03-27	Github	-
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs	arXiv	2023-12-21	Github	Local Demo
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing	arXiv	2023-11-01	Github	Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)	arXiv	2023-10-30	-	-
ControlLLM: Augment Language Models with Tools by Searching on Graphs	arXiv	2023-10-26	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
MindAgent: Emergent Gaming Interaction	arXiv	2023-09-18	Github	-
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language	arXiv	2023-06-28	Github	Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	arXiv	2023-06-15	-	-
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	arXiv	2023-06-14	Github	-
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
Mindstorms in Natural Language-Based Societies of Mind	arXiv	2023-05-26	-	-
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	arXiv	2023-05-24	Github	-
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	arXiv	2023-05-24	Github	Local Demo
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	arXiv	2023-05-10	Github	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ViperGPT: Visual Inference via Python Execution for Reasoning	arXiv	2023-03-14	Github	Local Demo
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions	arXiv	2023-03-12	Github	Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	-	-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	CVPR	2023-03-03	Github	-
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models	CVPR	2022-12-21	Github	Demo
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	arXiv	2022-11-28	Github	-
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning	CVPR	2022-11-21	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	arXiv	2022-04-01	Github	-

Foundation Models

Title	Venue	Date	Code	Demo
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models	Meta	2024-09-25	-	Demo
Pixtral-12B	Mistral	2024-09-17	-	-
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16	Github	-
The Llama 3 Herd of Models	arXiv	2024-07-31	-	-
Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	2024-05-16	-	-
Hello GPT-4o	OpenAI	2024-05-13	-	-
The Claude 3 Model Family: Opus, Sonnet, Haiku	Anthropic	2024-03-04	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	Google	2024-02-15	-	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
Fuyu-8B: A Multimodal Architecture for AI Agents	blog	2023-10-17	Huggingface	Demo
Unified Model for Image, Video, Audio and Language Tasks	arXiv	2023-07-30	Github	Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	arXiv	2023-10-13	-	-
GPT-4V(ision) System Card	OpenAI	2023-09-25	-	-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	arXiv	2023-09-09	Github	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	arXiv	2023-09-18	-	-
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training	NeurIPS	2023-07-13	Github	-
Generative Pretraining in Multimodality	arXiv	2023-07-11	Github	Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World	arXiv	2023-06-26	Github	Demo
Transfer Visual Prompt Generator across LLMs	arXiv	2023-05-02	Github	Demo
GPT-4 Technical Report	arXiv	2023-03-15	-	-
PaLM-E: An Embodied Multimodal Language Model	arXiv	2023-03-06	-	Demo
Prismer: A Vision-Language Model with An Ensemble of Experts	arXiv	2023-03-04	Github	Demo
Language Is Not All You Need: Aligning Perception with Language Models	arXiv	2023-02-27	Github	-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	arXiv	2023-01-30	Github	Demo
VIMA: General Robot Manipulation with Multimodal Prompts	ICML	2022-10-06	Github	Local Demo
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge	NeurIPS	2022-06-17	Github	-
Write and Paint: Generative Vision-Language Models are Unified Modal Learners	ICLR	2022-06-15	Github	-
Language Models are General-Purpose Interfaces	arXiv	2022-06-13	Github	-

Evaluation

Title	Venue	Date	Page
OmniBench: Towards The Future of Universal Omni-Language Models	arXiv	2024-09-23	Github
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-08-23	Github
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models	TPAMI	2023-10-17	Github
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation	arXiv	2024-06-29	Github
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs	arXiv	2024-06-28	Github
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	arXiv	2024-06-26	Github
ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation	arXiv	2024-04-15	Github
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	arXiv	2024-05-31	Github
Benchmarking Large Multimodal Models against Common Corruptions	NAACL	2024-01-22	Github
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	arXiv	2024-01-11	Github
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise	arXiv	2023-12-19	Github
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	arXiv	2023-12-05	Github
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11-27	Github
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	arXiv	2023-11-24	Github
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	arXiv	2023-11-23	Github
VLM-Eval: A General Evaluation on Video Large Language Models	arXiv	2023-11-20	Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11-06	Github
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving	arXiv	2023-11-09	Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead	arXiv	2023-11-05	-
A Comprehensive Study of GPT-4V’s Multimodal Capabilities in Medical Imaging	arXiv	2023-10-31	-
An Early Evaluation of GPT-4V(ision)	arXiv	2023-10-25	Github
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation	arXiv	2023-10-25	Github
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	CVPR	2023-10-23	Github
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	ICLR	2023-10-03	Github
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations	arXiv	2023-10-02	Github
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning	arXiv	2023-10-01	Github
Can We Edit Multimodal Large Language Models?	arXiv	2023-10-12	Github
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets	arXiv	2023-10-10	Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)	arXiv	2023-09-29	-
TouchStone: Evaluating Vision-Language Models by Language Models	arXiv	2023-08-31	Github
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	arXiv	2023-08-07	Github
Tiny LVLM-eHub: Early Multimodal Experiments with Bard	arXiv	2023-08-07	Github
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	arXiv	2023-08-04	Github
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR	2023-07-30	Github
MMBench: Is Your Multi-modal Model an All-around Player?	arXiv	2023-07-12	Github
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	arXiv	2023-06-23	Github
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	arXiv	2023-06-15	Github
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	arXiv	2023-06-08	Github
On The Hidden Mystery of OCR in Large Multimodal Models	arXiv	2023-05-13	Github

Multimodal RLHF

Title	Venue	Date	Code	Demo
Silkie: Preference Distillation for Large Visual Language Models	arXiv	2023-12-17	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo

Others

Title	Venue	Date	Code	Demo
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	arXiv	2024-02-03	Github	-
VCoder: Versatile Vision Encoders for Multimodal Large Language Models	arXiv	2023-12-21	Github	Local Demo
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	arXiv	2023-12-07	Github	-
Planting a SEED of Vision in Large Language Model	arXiv	2023-07-16	Github
Can Large Pre-trained Models Help Vision Models on Perception Tasks?	arXiv	2023-06-01	Github	-
Contextual Object Detection with Multimodal Large Language Models	arXiv	2023-05-29	Github	Demo
Generating Images with Multimodal Language Models	arXiv	2023-05-26	Github	-
On Evaluating Adversarial Robustness of Large Vision-Language Models	arXiv	2023-05-26	Github	-
Grounding Language Models to Images for Multimodal Inputs and Outputs	ICML	2023-01-31	Github	Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name	Paper	Type	Modalities
ShareGPT4Video	ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	Caption	Video-Text
COYO-700M	COYO-700M: Image-Text Pair Dataset	Caption	Image-Text
ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Caption	Image-Text
AS-1B	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Hybrid	Image-Text
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Caption	Video-Text
MS-COCO	Microsoft COCO: Common Objects in Context	Caption	Image-Text
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs	Caption	Image-Text
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning	Caption	Image-Text
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	Caption	Image-Text
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations	Caption	Image-Text
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models	Caption	Image-Text
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding	Caption	Image-Text
Wukong Captions	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark	Caption	Image-Text
GRIT	Kosmos-2: Grounding Multimodal Large Language Models to the World	Caption	Image-Text-Bounding-Box
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	Caption	Video-Text
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Caption	Video-Text
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Caption	Video-Text
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	Caption	Audio-Text
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline	ASR	Audio-Text
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale	ASR	Audio-Text
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ASR	Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Notes
UNK-VQA	UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models	Link	A dataset designed to teach models to refrain from answering unanswerable questions
VEGA	VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models	Link	A dataset for enhancing model capabilities in comprehension of interleaved information
ALLaVA-4V	ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	Link	Vision and language caption and instruction dataset generated by GPT4V
IDK	Visually Dehallucinative Instruction Generation: Know What You Don’t Know	Link	Dehallucinative visual instruction for “I Know” hallucination
CAP2QA	Visually Dehallucinative Instruction Generation	Link	Image-aligned visual instruction dataset
M3DBench	M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts	Link	A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	Link	A visual instruction dataset via self-instruction from GPT-4V
ComVint	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Link	A synthetic instruction dataset for complex visual reasoning
SparklesDialogue	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Link	A cheap and effective approach to collect visual instruction tuning data
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	-	A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Link	A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT	SVIT: Scaling up Visual Instruction Tuning	Link	A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Link	An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M	Visual Instruction Tuning with Polite Flamingo	Link	A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama	ChartLlama: A Multimodal LLM for Chart Understanding and Generation	Link	A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Link	A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT	MotionGPT: Human Motion as a Foreign Language	Link	A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Link	A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	100K high-quality video instruction dataset
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction tuning
M³IT	M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	Link	Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon	A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link	Tool-related instruction datasets
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon	Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT	DetGPT: Detect What You Need via Reasoning	Link	Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon	Large-scale medical visual question-answering dataset
VideoChat	VideoChat: Chat-Centric Video Understanding	Link	Video-centric multimodal instruction dataset
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link	Chinese multimodal instruction dataset
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link	A multi-modal instruction-tuning dataset
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link	Multimodal aligned dataset for improving model’s usability and generation’s fluency
LLaVA-Instruct-150K	Visual Instruction Tuning	Link	Multimodal instruction-following data generated by GPT
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link	The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name	Paper	Link	Notes
MIC	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Link	A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name	Paper	Link	Notes
EMER	Explainable Multimodal Emotion Reasoning	Coming soon	A benchmark dataset for explainable emotion reasoning task
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Coming soon	Large-scale embodied planning dataset
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	Coming soon	An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Link	Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name	Paper	Link	Notes
VLFeedback	Silkie: Preference Distillation for Large Visual Language Models	Link	A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name	Paper	Link	Notes
LiveXiv	LiveXiv – A Multi-Modal Live Benchmark Based on Arxiv Papers Content	Link	A live benchmark based on arXiv papers
TemporalBench	TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	Link	A benchmark for evaluation of fine-grained temporal understanding
OmniBench	OmniBench: Towards The Future of Universal Omni-Language Models	Link	A benchmark that evaluates models’ capabilities of processing visual, acoustic, and textual inputs simultaneously
MME-RealWorld	MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	Link	A challenging benchmark that involves real-life scenarios
CharXiv	CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	Link	Chart understanding benchmark curated by human experts
Video-MME	Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	Link	A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis
VL-ICL Bench	VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning	Link	A benchmark for M-ICL evaluation, covering a wide spectrum of tasks
TempCompass	TempCompass: Do Video LLMs Really Understand Videos?	Link	A benchmark to evaluate the temporal perception ability of Video LLMs
CoBSAT	Can MLLMs Perform Text-to-Image In-Context Learning?	Link	A benchmark for text-to-image ICL
VQAv2-IDK	Visually Dehallucinative Instruction Generation: Know What You Don’t Know	Link	A benchmark for assessing “I Know” visual hallucination
Math-Vision	Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset	Link	A diverse mathematical reasoning benchmark
CMMMU	CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark	Link	A Chinese benchmark involving reasoning and knowledge across multiple disciplines
MMCBench	Benchmarking Large Multimodal Models against Common Corruptions	Link	A benchmark for examining self-consistency under common corruptions
MMVP	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Link	A benchmark for assessing visual capabilities
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	Link	A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks.
ViP-Bench	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A benchmark for visual prompts
M3DBench	M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts	Link	A 3D-centric benchmark
Video-Bench	Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	Link	A benchmark for video-MLLM evaluation
Charting-New-Territories	Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs	Link	A benchmark for evaluating geographic and geospatial capabilities
MLLM-Bench	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	Link	GPT-4V evaluation with per-sample criteria
BenchLMM	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Link	A benchmark for assessment of the robustness against different image styles
MMC-Benchmark	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning	Link	A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Link	A comprehensive multimodal benchmark for video understanding
Bingo	Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	Link	A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench	OtterHD: A High-Resolution Multi-modality Model	Link	A benchmark designed to probe models’ ability of fine-grained perception
HallusionBench	HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	Link	An image-context reasoning benchmark for evaluation of hallucination
PCA-EVAL	Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond	Link	A benchmark for evaluating multi-domain embodied decision-making.
MMHal-Bench	Aligning Large Multimodal Models with Factually Augmented RLHF	Link	A benchmark for hallucination evaluation
MathVista	MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	Link	A benchmark that challenges both visual and math reasoning capabilities
SparklesEval	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A GPT-assisted benchmark for quantitatively assessing a model’s conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI	Link-Context Learning for Multimodal LLMs	Link	A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
I4	Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions	Link	A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA	SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	Link	A large-scale chart-visual question-answering dataset
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	Link	An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Link	A benchmark for evaluation of generative comprehension in MLLMs
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	Link	A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx	What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Link	A comprehensive evaluation benchmark including both image and video tasks
GAVIE	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	A benchmark to evaluate the hallucination and instruction following ability
MME	MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Link	A comprehensive MLLM Evaluation benchmark
LVLM-eHub	LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	Link	An evaluation platform for MLLMs
LAMM-Benchmark	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam	M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	Link	A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Link	Dataset for evaluation on multiple capabilities

Others

Name	Paper	Link	Notes
IMAD	IMAD: IMage-Augmented multi-modal Dialogue	Link	Multimodal dialogue dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek	Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?	Link	A VQA dataset that focuses on asking information-seeking questions
OVEN	Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities	Link	A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild

Stop Thinking, Just Do!

Multimodal Large Language Models

Tags

29 October 2024

Article Source

Multimodal Large Language Models

Others

Awesome Papers

Multimodal Instruction Tuning

Multimodal Hallucination

Multimodal In-Context Learning

Multimodal Chain-of-Thought

LLM-Aided Visual Reasoning

Foundation Models

Evaluation

Multimodal RLHF

Others

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Datasets of In-Context Learning

Datasets of Multimodal Chain-of-Thought

Datasets of Multimodal RLHF

Benchmarks for Evaluation

Others