Article Source
Multimodal Large Language Model Tutorial
Abstract
Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning&hallucination, multimodal reasoning of MLLMs and efficient learning in MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.
Literature
Section I: LLMs and MLLMs
- OpenAI, 2023, Introducing ChatGPT
- OpenAI, 2023, GPT-4 Technical Report
- Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
- Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
- Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
- Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
- Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
- Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
- Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
- Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
- Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
- Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
- Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
- Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
- Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
- Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
- Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
- Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
- Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
- Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
- Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
- Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
- Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
- Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
- Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
- Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
- Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
- Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
- Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
- Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
- Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
- Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
- Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
- Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
- Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
- Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
- Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
- Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
- Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
- Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
- Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
- Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
- Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
- Frey, et al., 2023, Neural Scaling of Deep Chemical Models
- Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
- Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
- Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
- Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
- Koh, et al., 2023, Generating Images with Multimodal Language Models
- Sun, et al., 2023, Generative Pretraining in Multimodality
- Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
- Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
- Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
- Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
- Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
- Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
- Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
- Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
- Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
- Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
- Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
- Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
- Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
- Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
- Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
- Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
- Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
- Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
- Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
- Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
- Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
- Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
- Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
- Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents
Section II: Instruction Tuning & Hallucination
- Liu, et al., 2023, Visual Instruction Tuning
- Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
- Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
- Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
- Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
- Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
- Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
- Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
- Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
- Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
- Yin, et al., 2023, A Survey on Multimodal Large Language Models
- Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models
Section III: Reasoning with LLM
- Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
- Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
- Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
- Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
- Sun, et al., 2023, Generative multimodal models are in-context learners
- Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
- Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
- Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
- Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
- Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
- Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
- Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
- Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Section IV: Efficient Learning
- Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
- Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
- Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
- Yao, et al., 2024, MiniCPM-V
- DeepSpeed Team, 2020, DeepSpeed Blog
- Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
- Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
- Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
- Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
- Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
- Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
- Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation