Stop Thinking, Just Do!

Sungsoo Kim's Blog

Multimodal Large Language Model Tutorial

Tags

machine learning ¹⁵⁰⁹

17 July 2024

Article Source

Multimodal Large Language Model Tutorial

Multimodal Large Language Model Tutorial

Web site

Abstract

Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities. As a multidisciplinary research field, multimodal large language models (MLLMs) have recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI via MLLMs. These large models offer an effective vehicle for understanding, reasoning, and planning by integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on four key areas: MLLM architecture design, instructional learning&hallucination, multimodal reasoning of MLLMs and efficient learning in MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research.

Literature

Section I: LLMs and MLLMs

OpenAI, 2023, Introducing ChatGPT
OpenAI, 2023, GPT-4 Technical Report
Alayrac, et al., 2022, Flamingo: a Visual Language Model for Few-Shot Learning
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Wu, et al., 2023, Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Shen, et al., 2023, HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Tang, et al., 2023, Any-to-Any Generation via Composable Diffusion
Girdhar, et al., 2023, ImageBind: One Embedding Space To Bind Them All
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
Moon, et al., 2023, AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Hu, et al., 2023, Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Bai, et al., 2023, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Wang, et al., 2023, CogVLM: Visual Expert for Pretrained Language Models
Peng, et al., 2023, Kosmos-2: Grounding Multimodal Large Language Models to the World
Dong, et al., 2023, InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Zhu, et al., 2023, LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Ge, et al., 2023, Planting a SEED of Vision in Large Language Model
Zhan, et al., 2024, AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Kondratyuk, et al., 2023, VideoPoet: A Large Language Model for Zero-Shot Video Generation
Zhang, et al., 2023, SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models
Zeghidour, et al., 2021, SoundStream: An End-to-End Neural Audio Codec
Liu, et al., 2023, Improved Baselines with Visual Instruction Tuning
Wu, et al., 2023, Visual-ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Wang, et al., 2023, ModaVerse: Efficiently Transforming Modalities with LLMs
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Lu, et al., 2023, Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Bai, et al., 2023, LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models
Huang, et al., 2023, Language Is Not All You Need: Aligning Perception with Language Models
Li, et al., 2023, VideoChat: Chat-Centric Video Understanding
Maaz, et al., 2023, Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Zhang, et al., 2023, Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Lin, et al., 2023, Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Qian, et al., 2024, Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Hong, et al., 2023, 3D-LLM: Injecting the 3D World into Large Language Models
Sun, et al., 2023, 3D-GPT: Procedural 3D Modeling with Large Language Models
Chen, et al., 2023, LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Xu, et al., 2023, PointLLM: Empowering Large Language Models to Understand Point Clouds
Chen, et al., 2024, SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Huang, et al., 2023, AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Zhang, et al., 2023, SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Wang, et al., 2023, VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Rubenstein, et al., 2023, AudioPaLM: A Large Language Model That Can Speak and Listen
Tang, et al., 2023, SALMONN: Towards Generic Hearing Abilities for Large Language Models
Latif, et al., 2023, Sparks of Large Audio Models: A Survey and Outlook
Luo, et al., 2022, BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Li, et al., 2023, DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
Chen, et al., 2023, MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Wang, et al., 2023, HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
Zhang, et al., 2023, AlpaCare:Instruction-tuned Large Language Models for Medical Application
Frey, et al., 2023, Neural Scaling of Deep Chemical Models
Zhang, et al., 2023, ChemLLM: A Chemical Large Language Model
Liu, et al., 2023, MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
Jiang, et al., 2023, StructGPT: A General Framework for Large Language Model to Reason on Structured Data
Chen, et al., 2024, LLaGA: Large Language and Graph Assistant
Koh, et al., 2023, Generating Images with Multimodal Language Models
Sun, et al., 2023, Generative Pretraining in Multimodality
Zheng, et al., 2023, MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Dong, et al., 2023, DreamLLM: Synergistic Multimodal Comprehension and Creation
Liu, et al., 2023, LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Wang, et al., 2023, GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Jin, et al., 2024, Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Jin, et al., 2023, Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Li, et al., 2023, LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Su, et al., 2023, PandaGPT: One Model to Instruction-Follow Them All
Lyu, et al., 2023, Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Tang, et al., 2023, CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Zhang, et al., 2023, GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Yuan, et al., 2023, Osprey: Pixel Understanding with Visual Instruction Tuning
Rasheed, et al., 2023, GLaMM: Pixel Grounding Large Multimodal Model
Pi, et al., 2023, DetGPT: Detect What You Need via Reasoning
Ren, et al., 2023, PixelLM: Pixel Reasoning with Large Multimodal Model
Lai, et al., 2023, Lisa: Reasoning segmentation via large language model
Chen, et al., 2023, Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Munasinghe, et al., 2023, PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
Yu, et al., 2023, Merlin: Empowering Multimodal LLMs with Foresight Minds
Fu, et al., 2023, MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Xu, et al., 2023, LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Ying, et al., 2024, MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Pan, et al., 2024, Auto-Encoding Morph-Tokens for Multimodal LLM
Thagard, et al., 1997, Abductive reasoning: Logic, visual thinking, and coherence
Bavishi, et al., 2023, Fuyu-8B: A Multimodal Architecture for AI Agents

Section II: Instruction Tuning & Hallucination

Liu, et al., 2023, Visual Instruction Tuning
Liu, et al., 2023, Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Gao, et al., 2023, LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Zhao, et al., 2023, SVIT: Scaling up Visual Instruction Tuning
Ye, et al., 2023, mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Yu, et al., 2023, RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Liu, et al., 2023, MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Zhao, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Liu, et al., 2023, HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Li, et al., 2023, Evaluating Object Hallucination in Large Vision-Language Models
Huang, et al., 2023, Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Yin, et al., 2023, A Survey on Multimodal Large Language Models
Yin, et al., 2023, Woodpecker: Hallucination Correction for Multimodal Large Language Models

Section III: Reasoning with LLM

Zhang, et al., 2023, Multimodal Chain-of-Thought Reasoning in Language Models
Zhao, et al., 2023, MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Lu, et al., 2023, Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Zhang, et al., 2023, You Only Look at Screens: Multimodal Chain-of-Action Agents
Sun, et al., 2023, Generative multimodal models are in-context learners
Fei, et al., 2023, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Wei, et al., 2023, Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Zhang, et al., 2023, Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Fei, et al., 2024, Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Prystawski, et al., 2023, Why think step by step? Reasoning emerges from the locality of experience
Gou, et al., 2023, CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Tang, et al., 2024, Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Yuan, et al., 2024, R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Section IV: Efficient Learning

Hu, et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models
Dettmers, et al., 2023, QLoRA: Efficient Finetuning of Quantized LLMs
Li, et al., 2023, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Luo, et al., 2023, Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Yao, et al., 2024, MiniCPM-V
DeepSpeed Team, 2020, DeepSpeed Blog
Zhao, et al., 2023, PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Zhu, et al., 2023, MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Chen, et al., 2023, MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Hong, et al., 2023, CogAgent: A Visual Language Model for GUI Agents
Chen, et al., 2024, How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Dehghani, et al., 2023, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Zhang, et al., 2023, VPGTrans: Transfer Visual Prompt Generator across LLMs
Wu, et al., 2023, NExT-GPT: Any-to-Any Multimodal LLM
Fei, et al., 2024, VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Zhang, et al., 2024, NExT-Chat: An LMM for Chat, Detection and Segmentation