Article Source
Multimodal Large Language Models
Awesome Papers
Multimodal Instruction Tuning
Multimodal Hallucination
Multimodal In-Context Learning
Multimodal Chain-of-Thought
LLM-Aided Visual Reasoning
Foundation Models
Evaluation
Multimodal RLHF
| Title | Venue | Date | Code | Demo |
|---|---|---|---|---|
Silkie: Preference Distillation for Large Visual Language Models |
arXiv | 2023-12-17 | Github | - |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
arXiv | 2023-12-01 | Github | Demo |
Aligning Large Multimodal Models with Factually Augmented RLHF |
arXiv | 2023-09-25 | Github | Demo |
Others
| Title | Venue | Date | Code | Demo |
|---|---|---|---|---|
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models |
arXiv | 2024-02-03 | Github | - |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
arXiv | 2023-12-21 | Github | Local Demo |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
arXiv | 2023-12-07 | Github | - |
Planting a SEED of Vision in Large Language Model |
arXiv | 2023-07-16 | Github | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? |
arXiv | 2023-06-01 | Github | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
Generating Images with Multimodal Language Models |
arXiv | 2023-05-26 | Github | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2023-01-31 | Github | Demo |
Awesome Datasets
Datasets of Pre-Training for Alignment
Datasets of Multimodal Instruction Tuning
| Name | Paper | Link | Notes |
|---|---|---|---|
| UNK-VQA | UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models | Link | A dataset designed to teach models to refrain from answering unanswerable questions |
| VEGA | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | Link | A dataset for enhancing model capabilities in comprehension of interleaved information |
| ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V |
| IDK | Visually Dehallucinative Instruction Generation: Know What You Don’t Know | Link | Dehallucinative visual instruction for “I Know” hallucination |
| CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset |
| M3DBench | M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset |
| ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |
| LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V |
| ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning |
| SparklesDialogue | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |
| StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data |
| M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
| MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
| BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
| SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
| mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
| PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
| ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation |
| LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
| MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks |
| LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
| Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
| LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
| Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
| MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning |
| M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
| LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
| GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
| MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
| DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
| PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
| VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
| X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
| LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
| cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model’s usability and generation’s fluency |
| LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
| MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Datasets of In-Context Learning
| Name | Paper | Link | Notes |
|---|---|---|---|
| MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
| MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset |
Datasets of Multimodal Chain-of-Thought
| Name | Paper | Link | Notes |
|---|---|---|---|
| EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
| EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
| VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
| ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Datasets of Multimodal RLHF
| Name | Paper | Link | Notes |
|---|---|---|---|
| VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI |
Benchmarks for Evaluation
Others
| Name | Paper | Link | Notes |
|---|---|---|---|
| IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset |
| Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models |
| CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions |
| Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |
| InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Link | A VQA dataset that focuses on asking information-seeking questions |
| OVEN | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | Link | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |