Awesome Physical AI
Awesome Physical AI 
A curated list of academic papers and resources on Physical AI — focusing on Vision-Language-Action (VLA) models, world models, embodied ai, and robotic foundation models.
Physical AI refers to AI systems that interact with and manipulate the physical world through robotic embodiments, combining perception, reasoning, and action in real-world environments.
Table of Contents
- Foundations
- VLA Architectures
- Action Representation
- World Models
- Reasoning & Planning
- Learning Paradigms
- Scaling & Generalization
- Deployment
- Safety & Alignment
- Lifelong Learning
- Applications
- Sim-to-Real Transfer
- Surveys
- Resources
- Companies & Projects
- Related Works
Foundations
Vision-Language Backbones
Core vision-language models that serve as pretrained backbones for Physical AI systems.
- CLIP: “Learning Transferable Visual Models From Natural Language Supervision”, ICML 2021. [Paper] [Code]
- Foundational model aligning vision and language that underlies most VLA perception systems.
-
SigLIP: “Sigmoid Loss for Language Image Pre-Training”, ICCV 2023. [Paper]
-
PaLI-X: “PaLI-X: On Scaling up a Multilingual Vision and Language Model”, CVPR 2024. [Paper]
-
LLaVA: “Visual Instruction Tuning”, NeurIPS 2023. [Paper] [Project]
- Prismatic VLMs: “Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models”, ICML 2024. [Paper] [Code]
- Systematic study of VLM design choices informing OpenVLA and other robotics VLMs.
Visual Representations
Self-supervised visual encoders and perception models used in robotics.
-
DINOv2: “DINOv2: Learning Robust Visual Features without Supervision”, arXiv, Apr 2023. [Paper] [Code]
-
R3M: “R3M: A Universal Visual Representation for Robot Manipulation”, CoRL 2022. [Paper] [Code]
-
MVP: “Masked Visual Pre-training for Motor Control”, arXiv, Mar 2022. [Paper] [Project]
-
Grounding DINO: “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection”, ECCV 2024. [Paper] [Code]
VLA Architectures
End-to-End VLAs
Monolithic models that treat vision, language, and actions as unified tokens in a single architecture.
- RT-1: “RT-1: Robotics Transformer for Real-World Control at Scale”, RSS 2023. [Paper] [Project] [Code]
- Pioneer proving large-scale multi-task data could train a single transformer for diverse manipulations.
- RT-2: “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, CoRL 2023. [Paper] [Project]
- Established the VLA paradigm by co-fine-tuning VLMs on robotic data.
- OpenVLA: “OpenVLA: An Open-Source Vision-Language-Action Model”, CoRL 2024. [Paper] [Project] [Code]
- Open-source 7B model that outperformed the 55B RT-2-X, democratizing VLA research.
- PaLM-E: “PaLM-E: An Embodied Multimodal Language Model”, ICML 2023. [Paper] [Project]
- 562B parameter model demonstrating emergent multi-modal chain-of-thought reasoning.
- VIMA: “VIMA: General Robot Manipulation with Multimodal Prompts”, ICML 2023. [Paper] [Project] [Code]
- Introduced multimodal prompting (text + images) for specifying manipulation tasks.
-
LEO: “An Embodied Generalist Agent in 3D World”, ICML 2024. [Paper] [Project]
-
3D-VLA: “3D-VLA: A 3D Vision-Language-Action Generative World Model”, ICML 2024. [Paper] [Project]
- Gato: “A Generalist Agent”, TMLR 2022. [Paper] [Blog]
- Single transformer handling 604 distinct tasks across games, chat, and robotics.
-
RoboFlamingo: “Vision-Language Foundation Models as Effective Robot Imitators”, ICLR 2024. [Paper] [Project]
-
Magma: “Magma: A Foundation Model for Multimodal AI Agents”, arXiv, Feb 2025. [Paper] [Code]
-
RoboVLMs: “Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models”, arXiv, Dec 2024. [Paper] [Project]
-
π0.5: “π0.5: A Vision-Language-Action Model with Open-World Generalization”, Physical Intelligence, Apr 2025. [Paper] [Project] [Code]
-
π0.6: “π0.6: A VLA that Learns from Experience”, Physical Intelligence, 2025. [Blog]
-
GR-3: “GR-3 Technical Report”, ByteDance Seed, Jul 2025. [Paper] [Project]
- UniVLA: “UniVLA: Unified Vision-Language-Action Model”, arXiv, Jun 2025. [Paper] [Code]
- TL;DR: Models vision, language, and actions as single interleaved stream of discrete tokens (VQ image tokens + FAST/DCT action tokens) in 8.5B autoregressive VLA. Two training stages: (1) post-train VLM with text/images to predict future frames, (2) finetune with vision and action token prediction. Emphasis on post-training stage to align VLM representations with robot tasks. Strong results on CALVIN, LIBERO, and SimplerEnv-Bridge.
-
SpatialVLA: “SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model”, arXiv, Jan 2025. [Paper]
-
AgiBot World: “AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems”, AgiBot, 2025. [Paper] [Project]
-
EnerVerse: “EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation”, AgiBot, Jan 2025. [Paper]
-
Genie Envisioner: “Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation”, AgiBot, 2025. [Paper] [Project] [Code]
-
GraspVLA: “GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data”, CoRL 2025. [Paper] [Project] [Code]
-
VLA-0: “VLA-0: Building State-of-the-Art VLAs with Zero Modification”, NVIDIA, 2025. [Paper] [Project] [Code]
-
ThinkAct: “ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning”, NVIDIA, NeurIPS 2025. [Paper] [Project]
-
OmniVLA: “OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception”, Microsoft Research, 2025. [Paper]
-
V-JEPA 2: “V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning”, Meta, Jun 2025. [Paper] [Project] [Code]
-
RoboBrain: “RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete”, CVPR 2025. [Paper] [Project]
-
DexGraspVLA: “DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping”, PsiBot, 2025. [Paper] [Project]
-
Hi Robot: “Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models”, Physical Intelligence, Feb 2025. [Paper] [Project]
-
Motus: “Motus: A Unified Latent Action World Model”, arXiv, Dec 2024. [Paper] [Project] [Code]
-
GR-RL: “GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation”, ByteDance Seed, Dec 2024. [Paper] [Project]
-
StarVLA: “StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing”, arXiv, 2025. [Report] [Code]
-
InternVLA-A1: “InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation”, arXiv, Jan 2026. [Paper] [Code]
-
InternVLA-M1: “InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy”, arXiv, Oct 2025. [Paper] [Project] [Code]
- Cosmos Policy: “Fine-Tuning Video Models for Visuomotor Control and Planning”, ICLR 2026 Submission.
- TL;DR: Finetunes NVIDIA Cosmos Video Foundation Model for action prediction. Core idea: inject additional modalities like future action chunks or value function estimations into latent token sequence. Good results on LIBERO with real-world comparisons against Pi0.5.
- Disentangled Robot Learning: “Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining”, ICLR 2026 Submission.
- TL;DR: Novel approach pretraining separate forward and inverse dynamics models. In second stage, combines them for coupled policy finetuning. Good results on CALVIN, decent on SIMPLER.
- XR-1: “Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations”, ICLR 2026 Submission.
- TL;DR: Introduces Unified Vision-Motion Codes (UVMC), a discrete latent representation jointly encoding visual dynamics and robotic motion using dual-branch VQ-VAE with shared codebook. Enables better co-pretraining from human and robot demonstrations. Tested against Groot-N.1.5 and Pi0 with good real-world results.
- VLM4VLA: “Revisiting Vision-Language-Models in Vision-Language-Action Models”, ICLR 2026 Submission.
- TL;DR: Comprehensive comparison of VLMs as backbone choice for VLAs. Finds downstream VLA performance has no correlation with VLM performance on standard benchmarks. Important finding for VLA architecture design, though limited to benchmark setups without real robot results.
- FLOWER: “FLOWER: Flow-based VLA for CALVIN Benchmark”, arXiv, 2025.
- State-of-the-art on CALVIN benchmarks using flow-based action generation.
- EO-1: “EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control”, arXiv, Aug 2025. [Paper] [Project] [Code]
Modular VLAs
Models that decouple cognition (VLM-based planning) from action (specialized motor modules).
- CogACT: “CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action”, arXiv, Nov 2024. [Paper] [Project]
- Decouples high-level cognition from low-level action via Diffusion Action Transformer.
- Gemini Robotics: “Gemini Robotics: Bringing AI into the Physical World”, arXiv, Mar 2025. [Paper] [Blog]
- Introduces “Thinking Before Acting” with internal natural language reasoning.
-
Helix: “Helix: A Vision-Language-Action Model for Generalist Humanoid Control”, arXiv, Apr 2025. [Paper]
- SayCan: “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, CoRL 2022. [Paper] [Project]
- First to combine LLM semantic knowledge with learned affordance functions.
- Code as Policies: “Code as Policies: Language Model Programs for Embodied Control”, arXiv, Sep 2022. [Paper] [Project]
- Seminal work showing LLMs can generate executable robot control code.
-
SayPlan: “SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning”, CoRL 2023. [Paper] [Project]
- Inner Monologue: “Inner Monologue: Embodied Reasoning through Planning with Language Models”, CoRL 2022. [Paper] [Project]
- Pioneered closed-loop language feedback where robots verbalize observations.
-
Instruct2Act: “Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions”, arXiv, May 2023. [Paper] [Code]
-
TidyBot: “TidyBot: Personalized Robot Assistance with Large Language Models”, IROS 2023. [Paper] [Project]
-
HybridVLA: “HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model”, arXiv, Mar 2025. [Paper] [Project]
-
CoT-VLA: “CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models”, CVPR 2025. [Paper] [Project]
-
OpenHelix: “OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model”, arXiv, May 2025. [Paper] [Project]
-
OneTwoVLA: “OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning”, arXiv, May 2025. [Paper] [Project]
-
Hume: “Hume: Introducing System-2 Thinking in Visual-Language-Action Model”, arXiv, May 2025. [Paper] [Project]
-
RationalVLA: “RationalVLA: A Rational Vision-Language-Action Model with Dual System”, arXiv, Jun 2025. [Paper] [Project]
-
Fast-in-Slow: “Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning”, arXiv, Jun 2025. [Paper] [Project]
-
TriVLA: “TriVLA: A Unified Triple-System-Based Unified Vision-Language-Action Model”, arXiv, Jul 2025. [Paper] [Project]
- DualVLA: “DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action”, arXiv, Nov 2025. [Paper] [Project]
Compact & Efficient VLAs
Lightweight VLA models optimized for fast inference and edge deployment.
-
TinyVLA: “TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models”, arXiv, Apr 2025. [Paper] [Project]
- SmolVLA: “SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning”, arXiv, Jun 2025. [Paper] [Code]
- 450M parameters achieving comparable performance to 10x larger models.
-
OpenVLA-OFT: “OpenVLA-OFT: Efficient Fine-Tuning for Open Vision-Language-Action Models”, arXiv, Mar 2025. [Paper]
-
RT-H: “RT-H: Action Hierarchies Using Language”, arXiv, Mar 2024. [Paper] [Project]
-
LAPA: “Latent Action Pretraining from Videos”, arXiv, Oct 2024. [Paper] [Project]
-
BitVLA: “BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation”, arXiv, Jun 2025. [Paper] [Code]
-
MoLe-VLA: “MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers”, arXiv, Mar 2025. [Paper] [Project]
-
VLA-Cache: “VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching”, arXiv, Feb 2025. [Paper]
-
NORA: “NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks”, arXiv, Apr 2025. [Paper] [Project]
-
NORA-1.5: “NORA-1.5: A Vision-Language-Action Model Trained using World Model and Action-based Preference Rewards”, arXiv, Nov 2025. [Paper] [Project] [Code]
-
CEED-VLA: “CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding”, arXiv, Jun 2025. [Paper] [Project]
-
Running VLAs at Real-time Speed: arXiv, Oct 2025. [Paper]
-
Cross-Platform Scaling of VLAs: “Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs”, arXiv, Sep 2025. [Paper]
- VLA-Adapter: “VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model”, arXiv, Sep 2025. [Paper] [Project]
Action Representation
Discrete Tokenization
Models that convert continuous joint movements into discrete “action tokens”.
- FAST: “FAST: Efficient Action Tokenization for Vision-Language-Action Models”, arXiv, Jan 2025. [Paper] [Project]
- Uses frequency-space (DCT) tokenization to compress action sequences 7x.
-
GR-1: “Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation”, ICLR 2024. [Paper] [Project]
-
GR-2: “GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge”, arXiv, Oct 2024. [Paper] [Project]
- ACT: “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware”, RSS 2023. [Paper] [Project] [Code]
- Introduced Action Chunking with Transformers for smooth bimanual manipulation.
-
Behavior Transformers: “Behavior Transformers: Cloning k Modes with One Stone”, NeurIPS 2022. [Paper] [Code]
- FASTer: “FASTer: Toward Powerful and Efficient Autoregressive VLAs with Learnable Action Tokenizer and Block-Wise Decoding”, ICLR 2026 Submission.
- TL;DR: Novel discrete action tokenizer combining Residual Vector Quantization (RVQ) with frequency L1 loss (DCT) and time domain L1 loss. Patchifies action tokens along temporal and grouped action dimension axes (e.g., base motion, arm joints). Higher compression ratio than FAST with strong results on SIMPLER and LIBERO.
- OmniSAT: “OmniSAT: Compact Action Token, Faster Autoregression for Vision-Language-Action Models”, ICLR 2026 Submission.
- TL;DR: VLA tokenizer using B-Splines for compact action representation. Two-stage encoding: (1) aligns different action chunk lengths into normalized fixed-length representation, (2) B-Spline encoder for compact representation, then VQ-VAE for discrete tokens. Improves upon both FAST and BEAST across LIBERO and SIMPLER.
- VQ-VLA: “VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers”, ICCV 2025. [Paper] [Project]
Discrete Diffusion VLAs
Models using discrete diffusion for parallel action token generation instead of autoregressive decoding.
- Discrete Diffusion VLA: “Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies”, arXiv, Aug 2025. [Paper]
- TL;DR: Takes OpenVLA and applies discrete diffusion for fast action chunk-based generation of discrete action tokens. Proposes adaptive decoding for inference with strong results on LIBERO + SIMPLER.
- dVLA: “Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought”, ICLR 2026 Submission.
- TL;DR: Discrete diffusion VLA using co-generation for future frames, text, and actions leveraging fast parallel sampling over AR models. Essentially ECoT + discrete diffusion done well, with good results on LIBERO + real world experiments.
- DIVA: “Discrete Diffusion Vision-Language-Action Models for Parallelized Action Generation”, ICLR 2026 Submission.
- TL;DR: Discrete diffusion VLA focusing on how to substitute tokens during inference for better performance through optimized token replacement strategies.
- Unified Diffusion VLA: “Unified Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process”, ICLR 2026 Submission.
- TL;DR: Generates future frames and discrete actions together with block-wise causal masking. Good results on CALVIN, LIBERO, and SIMPLER.
- LLaDA-VLA: “Vision Language Diffusion Action Models”, arXiv, Sep 2025. [Paper] [Project]
Continuous & Diffusion Policies
Models that use diffusion or flow matching to generate continuous trajectories.
- π₀ (pi-zero): “π₀: A Vision-Language-Action Flow Model for General Robot Control”, arXiv, Oct 2024. [Paper] [Project]
- Uses flow matching to generate high-frequency (50 Hz) continuous actions for dexterous tasks.
-
π₀.5: “π₀.5: Scaling Robot Foundation Models”, arXiv, Apr 2025. [Paper]
-
Octo: “Octo: An Open-Source Generalist Robot Policy”, RSS 2024. [Paper] [Project] [Code]
- Diffusion Policy: “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, RSS 2023. [Paper] [Project] [Code]
- Foundational work showing diffusion models excel at capturing multimodal action distributions.
-
RDT-1B: “RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation”, arXiv, Oct 2024. [Paper] [Project]
-
DexVLA: “DexVLA: Vision-Language Model with Plug-In Diffusion Expert”, arXiv, Feb 2025. [Paper] [Project]
-
Diffusion-VLA: “Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression”, arXiv, Dec 2024. [Paper] [Project]
-
3D Diffusion Policy: “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via 3D Representations”, RSS 2024. [Paper] [Project]
-
Moto: “Moto: Latent Motion Token as the Bridging Language for Robot Manipulation”, arXiv, Dec 2024. [Paper] [Project]
- Consistency Policy: “Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation”, RSS 2024. [Paper] [Project]
- Distills diffusion policies into single-step models for 10x faster inference.
-
Dita: “Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy”, arXiv, Mar 2025. [Paper] [Project]
-
Real-Time Chunking: “Real-Time Execution of Action Chunking Flow Policies”, Physical Intelligence, NeurIPS 2025. [Paper] [Project]
-
ManiFlow: “ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training”, CoRL 2025. [Paper] [Project] [Code]
-
Unified Video Action Model: “Unified Video Action Model”, RSS 2025. [Paper] [Project]
-
Streaming Flow Policy: “Streaming Flow Policy: Simplifying diffusion/flow-matching policies”, CoRL 2025 Oral. [Paper] [Project]
-
FlowPolicy: “FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy”, AAAI 2025. [Paper] [Code]
-
MoDE: “Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers”, ICLR 2025. [Paper] [Project]
-
Reactive Diffusion Policy: “Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning”, RSS 2025. [Paper] [Project]
-
VITA: “VITA: Vision-To-Action Flow Matching Policy”, arXiv, Jul 2025. [Paper] [Project]
-
Chain-of-Action: “Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation”, ByteDance Seed, Jun 2025. [Paper] [Project]
-
Hierarchical Diffusion Policy: “Hierarchical Diffusion Policy: Manipulation Trajectory Generation Via Contact Guidance”, TRO 2025. [Paper] [Code]
-
Adapt3R: “Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning”, arXiv, Mar 2025. [Paper] [Project]
-
3D FlowMatch Actor: “3D FlowMatch Actor: Unified 3D Policy for Single and Dual-Arm Manipulation”, arXiv, Aug 2025. [Paper] [Project] [Code]
-
CLAM: “CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations”, arXiv, May 2025. [Paper] [Project]
-
H3DP: “H3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning”, arXiv, May 2025. [Paper] [Project]
-
UniSkill: “UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations”, arXiv, May 2025. [Paper] [Project]
-
Latent Action Diffusion: “Latent Action Diffusion for Cross-Embodiment Manipulation”, arXiv, Jun 2025. [Paper] [Project]
-
Dex1B: “Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation”, RSS 2025. [Paper] [Project]
-
DemoDiffusion: “DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy”, arXiv, Jun 2025. [Paper] [Project]
-
One-Step Diffusion Policy: “One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation”, arXiv, Oct 2024. [Paper]
-
GauDP: “GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies”, NeurIPS 2025. [Paper] [Project]
-
DiWA: “DiWA: Diffusion Policy Adaptation with World Models”, CoRL 2025. [Paper] [Project] [Code]
- GPC: “Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition”, arXiv, Oct 2025. [Paper] [Project] [Code]
- TL;DR: Composes flow/diffusion-based VLA policies at test time using convex optimization and test-time search to combine scores from multiple policies. Improves performance without additional training by leveraging energy-based formulation that allows summing model scores.
World Models
JEPA & Latent Prediction
Joint-Embedding Predictive Architecture (JEPA) predicts future latent states rather than pixels.
- “A Path Towards Autonomous Machine Intelligence”, Meta AI, Jun 2022. [Paper]
- LeCun’s foundational vision describing the “world model in the middle” cognitive architecture.
-
I-JEPA: “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture”, CVPR 2023. [Paper] [Code]
-
V-JEPA: “Video Joint Embedding Predictive Architecture”, arXiv, Feb 2024. [Paper] [Code]
-
MC-JEPA: “MC-JEPA: Self-Supervised Learning of Motion and Content Features”, CVPR 2023. [Paper]
-
LeJEPA: “LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics”, arXiv, Nov 2025. [Paper]
-
VL-JEPA: “VL-JEPA: Vision-Language Joint Embedding Predictive Architecture”, arXiv, Dec 2025. [Paper]
- “Value-guided Action Planning with JEPA World Models”, arXiv, Jan 2026. [Paper]
Generative World Models
World models that generate pixels, video, or interactive environments.
- World Models: “World Models”, NeurIPS 2018. [Paper] [Project]
- Seminal Ha & Schmidhuber work popularizing world models for RL.
- DreamerV3: “Mastering Diverse Domains through World Models”, arXiv, Jan 2023. [Paper] [Project]
- State-of-the-art world model RL agent mastering 150+ tasks.
- Genie: “Genie: Generative Interactive Environments”, ICML 2024. [Paper] [Project]
- Learns interactive world models from unlabeled videos.
- Genie 2: “Genie 2: A Large-Scale Foundation World Model”, DeepMind, Dec 2024. [Blog]
- Generates diverse, playable 3D worlds from single images.
-
Sora: “Video Generation Models as World Simulators”, OpenAI, Feb 2024. [Blog]
-
GAIA-1: “GAIA-1: A Generative World Model for Autonomous Driving”, arXiv, Sep 2023. [Paper]
-
GameNGen: “Diffusion Models Are Real-Time Game Engines”, arXiv, Aug 2024. [Paper]
-
DIAMOND: “Diffusion for World Modeling: Visual Details Matter in Atari”, NeurIPS 2024. [Paper] [Code]
-
3D Gaussian Splatting: “3D Gaussian Splatting for Real-Time Radiance Field Rendering”, SIGGRAPH 2023. [Paper] [Project]
- “From Words to Worlds: Spatial Intelligence is AI’s Next Frontier”, World Labs, 2025. [Blog]
- Fei-Fei Li’s manifesto on generative, multimodal, actionable world models.
-
Marble: “Marble: A Multimodal World Model”, World Labs, Nov 2025. [Project]
- RTFM: “RTFM: A Real-Time Frame Model”, World Labs, Oct 2025. [Project]
Embodied World Models
World models designed for robotic manipulation, navigation, and physical reasoning.
-
Structured World Models: “Structured World Models from Human Videos”, RSS 2023. [Paper] [Project]
-
WHALE: “WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making”, arXiv, Nov 2024. [Paper]
-
“A Controllable Generative World Model for Robot Manipulation”, arXiv, Oct 2025. [Paper]
-
Code World Model: “Code World Model: Learning to Execute Code in World Simulation”, Meta AI, Oct 2025. [Paper]
-
PhyGDPO: “PhyGDPO: Physics-Aware Text-to-Video Generation via Direct Preference Optimization”, Meta AI, Jan 2026. [Paper]
-
“The Essential Role of Causality in Foundation World Models for Embodied AI”, arXiv, Feb 2024. [Paper]
-
MineDreamer: “MineDreamer: Learning to Follow Instructions via Chain-of-Imagination”, arXiv, Mar 2024. [Paper] [Project]
-
Video Language Planning: “Video Language Planning”, ICLR 2024. [Paper] [Project]
-
“Learning Universal Policies via Text-Guided Video Generation”, NeurIPS 2023. [Paper] [Project]
-
SIMA: “Scaling Instructable Agents Across Many Simulated Worlds”, arXiv, Mar 2024. [Paper] [Blog]
-
UniSim: “UniSim: Learning Interactive Real-World Simulators”, ICLR 2024. [Paper] [Project]
Reasoning & Planning
Chain-of-Thought & Deliberation
Models implementing “thinking before acting” with explicit reasoning or value-guided search.
-
Hume: “Hume: Introducing Deliberative Alignment in Embodied AI”, arXiv, May 2025. [Paper]
-
Embodied-CoT: “Robotic Control via Embodied Chain-of-Thought Reasoning”, arXiv, Jul 2024. [Paper] [Project]
-
ReAct: “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023. [Paper] [Code]
-
ReKep: “ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints”, CoRL 2024. [Paper] [Project]
-
TraceVLA: “TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness”, arXiv, Dec 2024. [Paper] [Project]
-
LLM-State: “LLM-State: Open World State Representation for Long-horizon Task Planning”, arXiv, Nov 2023. [Paper]
-
Statler: “Statler: State-Maintaining Language Models for Embodied Reasoning”, ICRA 2024. [Paper] [Project]
-
RoboReflect: “RoboReflect: Reflective Reasoning for Robot Manipulation”, arXiv, 2025. [Paper]
-
Cosmos-Reason1: “Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning”, NVIDIA, Mar 2025. [Paper] [Code]
-
EmbodiedVSR: “EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks”, arXiv, Mar 2025. [Paper]
-
Reflective Planning: “Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation”, arXiv, Feb 2025. [Paper]
-
Embodied-Reasoner: “Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks”, arXiv, Mar 2025. [Paper] [Project]
-
Embodied-R: “Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning via Reinforcement Learning”, arXiv, Apr 2025. [Paper] [Project]
-
RoBridge: “RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation”, arXiv, May 2025. [Paper] [Project]
-
Visual Embodied Brain: “Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces”, arXiv, Jun 2025. [Paper] [Code]
-
From Seeing to Doing: “From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation”, arXiv, May 2025. [Paper] [Project]
- Actions as Language: “Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting”, ICLR 2026 Submission.
- TL;DR: Instead of directly fine-tuning VLMs with discrete action tokens (which causes catastrophic forgetting), relabels robot datasets with subtasks, actions as text, and intermediate motion-planning phrases like “move left”. Bridges VLM domain gap without reducing VQA benchmark performance. Cheap LoRA finetuning achieves strong action prediction while maintaining VLM reasoning.
- InstructVLA: “Vision-Language-Action Instruction Tuning: From Understanding to Manipulation”, arXiv, Jul 2025. [Paper] [Project]
- TL;DR: Two-stage Vision-Language-Action Instruction Tuning pipeline: (1) pretrain action expert and latent action interface, (2) instruction-tune MoE-adapted VLM to switch between textual reasoning and latent action generation. Decouples multimodal reasoning from action generation to avoid catastrophic forgetting. Introduces instruction-based SIMPLER benchmark.
- Hybrid ECoT Training: “Hybrid Training for Vision-Language-Action Models”, ICLR 2026 Submission.
- TL;DR: Decomposes ECoT pretraining into think/act/follow subtasks that maintain performance benefits while enabling fast inference. Shows co-training with ECoT objectives results in better representations for action prediction.
- HAMLET: “Switch Your Vision-Language-Action Model into a History-Aware Policy”, ICLR 2026 Submission.
- TL;DR: Plug-and-play memory module with moment tokens to capture temporal information from prior timesteps. Memory module aggregates tokens over time for history-conditioned prediction, addressing the limitation that most VLAs only encode current images.
Error Detection & Recovery
Methods for detecting failures and correcting robot actions in real-time.
-
DoReMi: “Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment”, arXiv, Jul 2023. [Paper] [Project]
-
CoPAL: “Corrective Planning of Robot Actions with Large Language Models”, ICRA 2024. [Paper] [Project]
-
Code-as-Monitor: “Code-as-Monitor: Constraint-aware Visual Programming for Failure Detection”, CVPR 2025. [Paper] [Project]
-
AHA: “AHA: A Vision-Language-Model for Detecting and Reasoning over Failures”, arXiv, Oct 2024. [Paper]
-
PRED: “Pre-emptive Action Revision by Environmental Feedback”, CoRL 2024. [Paper]
Learning Paradigms
Imitation Learning
Behavioral cloning and learning from demonstrations.
-
CLIPort: “CLIPort: What and Where Pathways for Robotic Manipulation”, CoRL 2021. [Paper] [Project] [Code]
- Play-LMP: “Learning Latent Plans from Play”, CoRL 2019. [Paper] [Project]
- Learns reusable skills from unstructured “play” data without task labels.
-
MimicPlay: “MimicPlay: Long-Horizon Imitation Learning by Watching Human Play”, CoRL 2023. [Paper] [Project]
-
RVT: “RVT: Robotic View Transformer for 3D Object Manipulation”, CoRL 2023. [Paper] [Project] [Code]
-
RVT-2: “RVT-2: Learning Precise Manipulation from Few Demonstrations”, RSS 2024. [Paper] [Project]
-
DIAL: “Robotic Skill Acquisition via Instruction Augmentation”, arXiv, Nov 2022. [Paper] [Project]
-
Perceiver-Actor: “A Multi-Task Transformer for Robotic Manipulation”, CoRL 2022. [Paper] [Project] [Code]
-
BOSS: “Bootstrap Your Own Skills: Learning to Solve New Tasks with LLM Guidance”, CoRL 2023. [Paper] [Project]
-
Phantom: “Phantom: Training Robots Without Robots Using Only Human Videos”, arXiv, Mar 2025. [Paper] [Project]
-
ZeroMimic: “ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos”, arXiv, Mar 2025. [Paper] [Project]
-
Human2Robot: “Human2Robot: Learning Robot Actions from Paired Human-Robot Videos”, arXiv, Feb 2025. [Paper]
-
One-Shot Dual-Arm Imitation Learning: arXiv, Mar 2025. [Paper] [Project]
-
DataMIL: “DataMIL: Selecting Data for Robot Imitation Learning with Datamodels”, arXiv, May 2025. [Paper] [Project]
-
In-Context Imitation Learning: “In-Context Imitation Learning via Next-Token Prediction”, arXiv, Aug 2024. [Paper] [Project]
-
RILe: “RILe: Reinforced Imitation Learning”, arXiv, Mar 2025. [Paper]
- CLASS: “CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation”, CoRL 2025. [Paper] [Project] [Code]
Reinforcement Learning
RL-based methods for optimizing VLA policies.
- CO-RFT: “CO-RFT: Chunked Offline Reinforcement Learning Fine-Tuning for VLAs”, arXiv, 2026. [Paper]
- Two-stage offline RL achieving 57% improvement over imitation learning.
- HICRA: “HICRA: Hierarchy-Aware Credit Assignment for Reinforcement Learning in VLAs”, arXiv, 2026. [Paper]
- Focuses optimization on “planning tokens” rather than execution tokens.
-
FLaRe: “FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale RL Fine-Tuning”, arXiv, Sep 2024. [Paper] [Project]
-
Plan-Seq-Learn: “Plan-Seq-Learn: Language Model Guided RL for Long Horizon Tasks”, ICLR 2024. [Paper] [Project]
-
GLAM: “Grounding Large Language Models in Interactive Environments with Online RL”, arXiv, Feb 2023. [Paper] [Code]
-
ELLM: “Guiding Pretraining in Reinforcement Learning with Large Language Models”, ICML 2023. [Paper]
-
RL4VLA: “RL4VLA: What Can RL Bring to VLA Generalization?”, NeurIPS 2025. [Paper]
-
TPO: “TPO: Trajectory-wise Preference Optimization for VLAs”, arXiv, 2025. [Paper]
-
ReinboT: “ReinboT: Reinforcement Learning for Robotic Manipulation”, arXiv, 2025. [Paper]
-
VLA-RL: “VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable RL”, arXiv, May 2025. [Paper] [Code]
-
SimpleVLA-RL: “SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning”, arXiv, Sep 2025. [Paper] [Code]
-
ConRFT: “ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy”, RSS 2025. [Paper] [Project]
-
VLA-RFT: “VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards”, arXiv, Oct 2025. [Paper] [Project] [Code]
-
RLinf-VLA: “RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training”, arXiv, Oct 2025. [Paper] [Code]
-
RoboMonkey: “RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models”, CoRL 2025. [Paper] [Project] [Code]
- Embodied-R1: “Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation”, arXiv, Aug 2025. [Paper] [Project] [Code]
- TL;DR: Pointing VLM for embodied reasoning trained via two-stage Reinforced Fine-Tuning (RFT) curriculum on Embodied-Points-200K dataset. Uses embodiment-agnostic intermediates: REG (point to referred object), RRG (point to relation-defined place), OFG (point to functional part like handle), VTG (output point sequence as visual trace). Strong embodied benchmark performance with good SIMPLER generalization as planner.
-
VLA-Reasoner: “VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search”, arXiv, Sep 2025. [Paper]
- Residual RL for VLAs: “Self-Improving Vision-Language-Action Models with Data Generation via Residual RL”, ICLR 2026 Submission.
- TL;DR: Residual RL method collecting data with frozen VLA and small residual policy. Residual interventions provide high-quality data with recovery behavior. Finally finetunes VLA using SFT. Achieves 99% on LIBERO.
- STA-PPO/TPO: “Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models”, ICLR 2026 Submission.
- TL;DR: Breaks robot tasks into semantic stages (Reach→Grasp→Transport→Place) and assigns rewards to each stage instead of whole trajectory. Uses STA-TPO for offline preference learning and STA-PPO for online RL, both at stage level. Achieves 98% on Bridge SIMPLER.
- Verifier-free Test-Time Sampling: “Verifier-free Test-Time Sampling for Vision Language Action Models”, arXiv, Oct 2025. [Paper]
Reward Design
Automated reward function generation using language models.
-
Text2Reward: “Text2Reward: Automated Dense Reward Function Generation”, arXiv, Sep 2023. [Paper] [Project]
-
Language to Rewards: “Language to Rewards for Robotic Skill Synthesis”, CoRL 2023. [Paper] [Project]
-
ExploRLLM: “ExploRLLM: Guiding Exploration in Reinforcement Learning with LLMs”, arXiv, Mar 2024. [Paper]
Scaling & Generalization
Scaling Laws
Mathematical relationships between model/data scale and robotic performance.
-
“Neural Scaling Laws for Embodied AI”, arXiv, May 2024. [Paper]
-
“Data Scaling Laws in Imitation Learning for Robotic Manipulation”, arXiv, Oct 2024. [Paper] [Project]
-
AutoRT: “AutoRT: Embodied Foundation Models for Large Scale Orchestration”, ICRA 2024. [Paper] [Project]
-
SARA-RT: “SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention”, arXiv, Dec 2023. [Paper]
-
“Scaling Robot Learning with Semantically Imagined Experience”, RSS 2023. [Paper]
Cross-Embodiment Transfer
Single policies controlling diverse robot types.
-
RT-X: “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, ICRA 2024. [Paper] [Project]
- GENBOT-1K: “Towards Embodiment Scaling Laws: Training on ~1000 Robot Bodies”, arXiv, 2025. [Paper]
- Training on ~1,000 robot bodies enables zero-shot transfer to unseen robots.
- Crossformer: “Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion”, CoRL 2024. [Paper] [Project]
- Single policy controlling manipulators, legged robots, and drones.
-
HPT: “Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers”, NeurIPS 2024. [Paper] [Project]
-
MetaMorph: “MetaMorph: Learning Universal Controllers with Transformers”, ICLR 2022. [Paper] [Project]
-
RUMs: “Robot Utility Models: General Policies for Zero-Shot Deployment”, arXiv, Sep 2024. [Paper] [Project]
-
URMA: “Unified Robot Morphology Architecture”, arXiv, 2025. [Paper]
-
RoboAgent: “RoboAgent: Generalization and Efficiency via Semantic Augmentations”, ICRA 2024. [Paper] [Project]
- X-VLA: “Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”, arXiv, Oct 2025. [Paper] [Project] [Code]
- TL;DR: Tackles cross-action-space learning using soft-prompting tokens for different datasets. Soft-prompt tokens are learnable readout-tokens for the VLA. Strong results on LIBERO, CALVIN, SIMPLER, RoboTwin, and VLABench. Includes insightful scaling analysis ablating pretraining design decisions.
- HiMoE-VLA: “Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies”, arXiv, Dec 2025. [Paper] [Code]
- TL;DR: Substitutes Pi-style action expert with Hierarchical Mixture-of-Experts Transformer for better embodiment adaptation. Interleaves standard blocks with Action-Space MoEs and Heterogeneity Balancing MoEs to handle different action spaces. Improves upon Pi0 across experiments.
- D2E: “Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI”, arXiv, Oct 2025. [Paper] [Project] [Code]
Open-Vocabulary Generalization
Models that generalize to novel visual appearances and semantic concepts.
-
MOO: “Open-World Object Manipulation using Pre-trained Vision-Language Models”, CoRL 2023. [Paper] [Project]
- VoxPoser: “VoxPoser: Composable 3D Value Maps for Robotic Manipulation”, CoRL 2023. [Paper] [Project]
- Generates 3D affordance and constraint maps from language for zero-shot manipulation.
-
RoboPoint: “RoboPoint: A Vision-Language Model for Spatial Affordance Prediction”, CoRL 2024. [Paper] [Project]
-
CLIP-Fields: “CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory”, RSS 2023. [Paper] [Project]
-
VLMaps: “Visual Language Maps for Robot Navigation”, ICRA 2023. [Paper] [Project]
-
NLMap: “Open-vocabulary Queryable Scene Representations”, ICRA 2023. [Paper] [Project]
-
LERF: “LERF: Language Embedded Radiance Fields”, ICCV 2023. [Paper] [Project]
- Any-point Trajectory: “Any-point Trajectory Modeling for Policy Learning”, RSS 2024. [Paper] [Project]
Deployment
Quantization & Compression
Low-bit weight quantization for efficient edge deployment.
-
BitVLA: “BitVLA: 1-bit Vision-Language-Action Models for Robotics”, arXiv, 2025. [Paper]
-
DeeR-VLA: “DeeR-VLA: Dynamic Inference of Multimodal LLMs for Efficient Robot Execution”, arXiv, Nov 2024. [Paper] [Code]
-
QuaRT-VLA: “Quantized Robotics Transformers for Vision-Language-Action Models”, arXiv, 2025. [Paper]
-
PDVLA: “PDVLA: Parallel Decoding for Vision-Language-Action Models”, arXiv, 2025. [Paper]
- HyperVLA: “Efficient Inference in Vision-Language-Action Models via Hypernetworks”, ICLR 2026 Submission.
- TL;DR: Uses hypernetworks to generate small task-specific policies conditioned on language instructions and initial images. Dramatically reduces inference cost by only activating the compact generated policy during execution instead of the full VLA model.
- AutoQVLA: “Not All Channels Are Equal in Vision-Language-Action Model’s Quantization”, ICLR 2026 Submission.
- TL;DR: Analyzes quantization of OpenVLA and proposes improved quantization method that maintains performance with only 30% of the original VRAM requirements through channel-aware optimization.
-
SpecPrune-VLA: “Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning”, arXiv, Sep 2025. [Paper]
- RLRC: “Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models”, arXiv, Jun 2025. [Paper] [Project]
Real-Time Control
Methods bridging high-latency AI inference and low-latency physical control.
-
A2C2: “A2C2: Asynchronous Action Chunk Correction for Real-Time Robot Control”, arXiv, 2025. [Paper]
-
RTC: “Real-Time Chunking: Asynchronous Execution for Robot Control”, arXiv, 2025. [Paper]
Safety & Alignment
Ethical constraints, safety frameworks, and human-robot alignment.
- Robot Constitution: “Gemini Robotics: Bringing AI into the Physical World”, arXiv, Mar 2025. [Paper]
- Introduces data-driven “Robot Constitution” with natural language rules for safe behavior.
-
ASIMOV: “ASIMOV: A Safety Benchmark for Embodied AI”, arXiv, Mar 2025. [Paper]
-
RoboPAIR: “Jailbreaking LLM-Controlled Robots”, ICRA 2025. [Paper] [Project]
-
RoboGuard: “Safety Guardrails for LLM-Enabled Robots”, arXiv, Apr 2025. [Paper]
-
“Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics”, arXiv, Feb 2024. [Paper]
- “Robots Enact Malignant Stereotypes”, FAccT 2022. [Paper] [Project]
- First study showing robots inherit harmful biases from vision-language pretraining.
-
“LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions”, arXiv, Jun 2024. [Paper]
- “Safe LLM-Controlled Robots with Formal Guarantees via Reachability Analysis”, arXiv, Mar 2025. [Paper]
Lifelong Learning
Agents that continuously learn and adapt without forgetting prior skills.
- Voyager: “VOYAGER: An Open-Ended Embodied Agent with Large Language Models”, arXiv, May 2023. [Paper] [Project] [Code]
- First LLM-powered agent in Minecraft autonomously building a skill library.
-
RoboGen: “RoboGen: A Generative and Self-Guided Robotic Agent”, arXiv, Nov 2023. [Paper] [Project]
-
RoboCat: “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation”, arXiv, Jun 2023. [Paper] [Blog]
-
LOTUS: “LOTUS: Continual Imitation Learning via Unsupervised Skill Discovery”, arXiv, Dec 2024. [Paper] [Project]
-
DEPS: “Describe, Explain, Plan and Select: Interactive Planning with LLMs for Open-World Agents”, NeurIPS 2023. [Paper] [Code]
-
JARVIS-1: “JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal LLMs”, arXiv, Nov 2023. [Paper] [Project]
-
MP5: “MP5: A Multi-modal Open-ended Embodied System via Active Perception”, CVPR 2024. [Paper] [Project]
- SPRINT: “SPRINT: Semantic Policy Pre-training via Language Instruction Relabeling”, ICRA 2024. [Paper] [Project]
Applications
Humanoid Robots
Foundation models for humanoid robot control.
-
GR00T N1: “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots”, arXiv, Mar 2025. [Paper] [Project]
-
HumanPlus: “HumanPlus: Humanoid Shadowing and Imitation from Humans”, arXiv, Jun 2024. [Paper] [Project]
-
ExBody: “Expressive Whole-Body Control for Humanoid Robots”, RSS 2024. [Paper] [Project]
-
H2O: “Learning Human-to-Humanoid Real-Time Whole-Body Teleoperation”, IROS 2024. [Paper] [Project]
-
OmniH2O: “OmniH2O: Universal Human-to-Humanoid Teleoperation and Learning”, CoRL 2024. [Paper] [Project]
-
“Learning Humanoid Locomotion with Transformers”, arXiv, Mar 2024. [Paper] [Project]
Manipulation
Robot manipulation with foundation models.
-
Scaling Up Distilling Down: “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition”, CoRL 2023. [Paper] [Project]
-
LLM3: “LLM3: Large Language Model-based Task and Motion Planning with Failure Reasoning”, IROS 2024. [Paper]
-
ManipVQA: “ManipVQA: Injecting Robotic Affordance into Multi-Modal LLMs”, IROS 2024. [Paper]
-
UniAff: “UniAff: A Unified Representation of Affordances for Tool Usage and Articulation”, arXiv, Sep 2024. [Paper]
-
SKT: “SKT: State-Aware Keypoint Trajectories for Robotic Garment Manipulation”, arXiv, Sep 2024. [Paper]
-
Manipulate-Anything: “Manipulate-Anything: Automating Real-World Robots using VLMs”, CoRL 2024. [Paper] [Project]
-
A3VLM: “A3VLM: Actionable Articulation-Aware Vision Language Model”, CoRL 2024. [Paper]
-
LaN-Grasp: “Language-Driven Grasp Detection”, CVPR 2024. [Paper]
-
Grasp Anything: “Pave the Way to Grasp Anything: Transferring Foundation Models”, arXiv, Jun 2023. [Paper]
Navigation
Vision-language models for robot navigation.
-
LM-Nav: “Robotic Navigation with Large Pre-Trained Models”, CoRL 2022. [Paper] [Project]
-
NaVILA: “NaVILA: Legged Robot Vision-Language-Action Model for Navigation”, arXiv, Dec 2024. [Paper] [Project]
-
CoW: “CLIP on Wheels: Zero-Shot Object Navigation”, ICRA 2023. [Paper]
-
L3MVN: “L3MVN: Leveraging Large Language Models for Visual Target Navigation”, IROS 2024. [Paper]
-
NaVid: “NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation”, RSS 2024. [Paper] [Project]
-
OVSG: “Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs”, CoRL 2023. [Paper] [Project]
-
CANVAS: “CANVAS: Commonsense-Aware Navigation System”, ICRA 2025. [Paper]
-
VLN-BERT: “Improving Vision-and-Language Navigation with Image-Text Pairs from the Web”, ECCV 2020. [Paper]
-
ThinkBot: “ThinkBot: Embodied Instruction Following with Thought Chain Reasoning”, arXiv, Dec 2023. [Paper]
-
ApexNav: “ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation”, RA-L 2025. [Paper] [Project]
-
LOVON: “LOVON: Legged Open-Vocabulary Object Navigator”, arXiv, Jul 2025. [Paper] [Project]
-
Multimodal Spatial Language Maps: “Multimodal Spatial Language Maps for Robot Navigation and Manipulation”, IJRR 2025. [Paper] [Project]
-
Learned Perceptive Forward Dynamics Model: “Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation”, arXiv, Apr 2025. [Paper] [Code]
-
VL-Nav: “VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning”, arXiv, Feb 2025. [Paper]
-
TRAVEL: “TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation”, arXiv, Feb 2025. [Paper]
-
VR-Robo: “VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion”, arXiv, Feb 2025. [Paper]
-
NavigateDiff: “NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants”, arXiv, Feb 2025. [Paper]
-
MapNav: “MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based VLN”, arXiv, Feb 2025. [Paper]
-
OpenFly: “OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation”, arXiv, Feb 2025. [Paper]
-
WMNav: “WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation”, arXiv, Mar 2025. [Paper] [Project]
-
SmartWay: “SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation”, arXiv, Mar 2025. [Paper]
-
UniGoal: “UniGoal: Towards Universal Zero-shot Goal-oriented Navigation”, arXiv, Mar 2025. [Paper] [Project]
-
P3Nav: “P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction”, arXiv, Mar 2025. [Paper]
-
ForesightNav: “ForesightNav: Learning Scene Imagination for Efficient Exploration”, arXiv, Apr 2025. [Paper] [Code]
-
CityNavAgent: “CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning”, arXiv, May 2025. [Paper] [Code]
-
NavDP: “NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance”, arXiv, May 2025. [Paper]
-
OctoNav: “OctoNav: Towards Generalist Embodied Navigation”, arXiv, Jun 2025. [Paper] [Project]
-
BeliefMapNav: “BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation”, arXiv, Jun 2025. [Paper] [Code]
-
TopV-Nav: “TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation”, arXiv, Nov 2024. [Paper]
-
CorrectNav: “CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model”, arXiv, Aug 2025. [Paper] [Project]
-
GC-VLN: “GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation”, CoRL 2025. [Paper] [Project] [Code]
-
NavFoM: “Embodied Navigation Foundation Model”, arXiv, Sep 2025. [Paper] [Project]
-
Search-TTA: “Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild”, CoRL 2025. [Paper] [Project] [Code]
-
JanusVLN: “JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation”, arXiv, Sep 2025. [Paper] [Project] [Code]
-
TrackVLA++: “TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking”, arXiv, Oct 2025. [Paper] [Project]
-
InternNav: “Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation”, arXiv, Dec 2025. [Paper] [Project] [Code]
-
OmniVLA-Nav: “OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation”, arXiv, Sep 2025. [Paper] [Project] [Code]
Sim-to-Real Transfer
Methods for bridging simulation-trained policies to real-world deployment.
-
RE3SIM: “RE3SIM: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation”, arXiv, Feb 2025. [Paper]
-
Real-to-Sim-to-Real with VLM-Generated Rewards: “A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards”, arXiv, Feb 2025. [Paper]
-
Distributional Real2Sim2Real: “A Distributional Treatment of Real2Sim2Real for Vision-Driven Deformable Linear Object Manipulation”, arXiv, Feb 2025. [Paper]
-
Sim-to-Real for Vision-Based Dexterous Manipulation on Humanoids: “Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids”, arXiv, Feb 2025. [Paper] [Project]
-
Impact of Static Friction on Sim2Real: “Impact of Static Friction on Sim2Real in Robotic Reinforcement Learning”, arXiv, Mar 2025. [Paper]
-
Few-shot Sim2Real: “Few-shot Sim2Real Based on High Fidelity Rendering with Force Feedback Teleop”, arXiv, Mar 2025. [Paper]
-
RSR Loop Framework: “An Real-Sim-Real (RSR) Loop Framework for Generalizable Robotic Policy Transfer with Differentiable Simulation”, arXiv, Mar 2025. [Paper] [Code]
-
Real2Render2Real: “Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware”, arXiv, May 2025. [Paper] [Project]
Surveys
Comprehensive reviews and taxonomies of VLA, world models, and embodied AI research.
- Foundation Models in Robotics: “Foundation Models in Robotics: Applications, Challenges, and the Future”, IJRR 2025. [Paper] [GitHub]
- Surveys foundation models across perception, decision-making, and control in robotics.
- Learning-based Dynamics Models: “A Review of Learning-based Dynamics Models for Robotic Manipulation”, Science Robotics 2025. [Paper]
- Reviews learned dynamics models for predicting physical interactions in manipulation.
- VLA Anatomy Survey: “An Anatomy of VLA Models: From Modules to Milestones”, arXiv, Dec 2025. [Paper] [Project]
- Structural blueprint of VLA modules, milestones, and five core challenges (representation, execution, generalization, safety, evaluation).
- World Models for Embodied AI: “A Comprehensive Survey on World Models for Embodied AI”, arXiv, Oct 2025. [Paper] [GitHub]
- Taxonomizes world models for embodied agents across prediction, planning, and simulation.
- VLA for Real-World Robotics: “VLA Models for Robotics: Real-World Applications”, arXiv, Oct 2025. [Paper]
- Full-stack review bridging VLA research to practical robotics deployment.
- Pure VLA Survey: “Pure VLA Models: A Comprehensive Survey”, arXiv, Sep 2025. [Paper]
- Taxonomy of VLA paradigms: autoregressive, diffusion, RL, and hybrid methods.
- Large VLM-based VLA: “Large VLM-based VLA for Robotic Manipulation”, arXiv, Aug 2025. [Paper] [GitHub]
- Reviews VLAs built on large pretrained VLMs, comparing monolithic vs hierarchical designs.
- Embodied AI Decision-Making: “Large Model Empowered Embodied AI: Decision-Making”, arXiv, Aug 2025. [Paper]
- Surveys how large models enable embodied decision-making and planning.
- Foundation Model Driven Robotics: “Foundation Model Driven Robotics”, arXiv, Jul 2025. [Paper]
- Overview of foundation models transforming perception, planning, and control in robotics.
- Action Tokenization Survey: “VLA Survey: An Action Tokenization Perspective”, PKU-PsiBot, Jul 2025. [Paper]
- Analyzes VLA design through action token formats (language, code, affordance, trajectory, latent).
- VLA for Autonomous Driving: “VLA Models for Autonomous Driving”, arXiv, Jun 2025. [Paper] [GitHub]
- Surveys VLA applications in end-to-end autonomous driving systems.
- VLA Post-Training: “VLA Post-Training and Human Motor Learning”, arXiv, Jun 2025. [Paper] [GitHub]
- Reviews post-training methods for VLAs including RL finetuning and human feedback.
- Deep RL for Robotics: “Deep RL for Robotics: Real-World Successes”, arXiv, Aug 2024. [Paper]
- Surveys successful real-world deployments of deep RL in robotics.
- Diffusion Policy Survey: “Survey on Diffusion Policy for Robotic Manipulation”, TechRxiv 2025. [Paper] [GitHub]
- Reviews diffusion-based policies for robotic manipulation tasks.
- Industrial Robotics Survey: “Embodied Intelligent Industrial Robotics”, arXiv, May 2025. [Paper] [GitHub]
- Surveys embodied AI for industrial automation and manufacturing.
- Neural Brain Framework: “Neural Brain: Framework for Embodied Agents”, arXiv, May 2025. [Paper] [GitHub]
- Proposes unified framework viewing embodied agents through neural architecture lens.
- VLA Concepts Survey: “VLA Models: Concepts, Progress, Applications”, arXiv, May 2025. [Paper]
- Introductory survey covering VLA fundamentals and application domains.
- Navigation with Simulators: “Robotic Navigation with Physics Simulators”, arXiv, May 2025. [Paper]
- Reviews sim-to-real transfer methods for robotic navigation.
- Multimodal Navigation: “Multimodal Perception for Goal-oriented Navigation”, arXiv, Apr 2025. [Paper]
- Surveys multimodal perception fusion for robot navigation tasks.
- Diffusion for Manipulation: “Diffusion Models for Robotic Manipulation”, arXiv, Apr 2025. [Paper]
- Reviews diffusion model applications in manipulation policy learning.
- Dexterous Manipulation Survey: “Dexterous Manipulation through Imitation Learning”, arXiv, Apr 2025. [Paper]
- Surveys imitation learning methods for dexterous robot hands.
- Multimodal Fusion for Robotics: “Multimodal Fusion and VLMs for Robot Vision”, arXiv, Apr 2025. [Paper] [GitHub]
- Reviews multimodal fusion techniques for robot perception systems.
- SE(3)-Equivariant Learning: “SE(3)-Equivariant Robot Learning: Tutorial Survey”, arXiv, Mar 2025. [Paper]
- Tutorial on incorporating geometric symmetries into robot learning.
- Generative AI for Manipulation: “Generative AI in Robotic Manipulation”, arXiv, Mar 2025. [Paper] [GitHub]
- Surveys generative models (diffusion, LLMs, VLMs) for manipulation tasks.
- VLA Survey 2025: “Survey on Vision-Language-Action Models”, arXiv, Feb 2025. [Paper]
- AI-generated survey demonstrating automated literature review for VLAs.
- Embodied Multimodal Models: “Exploring Embodied Multimodal Large Models”, arXiv, Feb 2025. [Paper]
- Surveys multimodal LLMs adapted for embodied reasoning and control.
- General-Purpose Robots Survey: “General-Purpose Robots via Foundation Models”, arXiv, Dec 2023. [Paper]
- Early survey on using foundation models for general-purpose robotics.
- Robot Learning Foundation Models: “Robot Learning in the Era of Foundation Models”, arXiv, Nov 2023. [Paper]
- Surveys how foundation models are reshaping robot learning paradigms.
- Language-conditioned Manipulation: “Language-conditioned Learning for Manipulation”, arXiv, Dec 2023. [Paper]
- Reviews language-guided policy learning for manipulation.
- LLMs for Navigation: “LLMs for Embodied Navigation”, arXiv, Nov 2023. [Paper]
- Surveys LLM applications in robot navigation and exploration.
- Object-Centric Manipulation: “Embodied Learning for Object-Centric Manipulation”, arXiv, Aug 2024. [Paper]
- Reviews object-centric representations for manipulation learning.
- VLA for Embodied AI: “A Survey on VLA Models for Embodied AI”, arXiv, May 2024. [Paper]
- Taxonomizes VLAs by component (vision, language, action) and control policy type.
- Cyber-Physical Alignment: “Aligning Cyber Space with Physical World”, arXiv, Jul 2024. [Paper]
- Surveys bridging digital AI models with physical robot execution.
- VLN Survey: “Vision-Language Navigation: Survey and Taxonomy”, arXiv, 2024. [Paper]
- Comprehensive taxonomy of vision-language navigation methods and benchmarks.
- State of VLA at ICLR 2026: “State of VLA Research at ICLR 2026”, Blog, Oct 2025. [Blog]
- Analysis of VLA trends from ICLR 2026 submissions covering discrete diffusion, ECoT, and tokenizers.
Resources
Datasets & Benchmarks
| Dataset | Scale | Focus | Links |
|---|---|---|---|
| Open X-Embodiment | 1M+ trajectories, 22 robots | Cross-embodiment | Paper · Project |
| DROID | 76K trajectories, 564 scenes | In-the-wild manipulation | Paper · Project |
| BridgeData V2 | Multi-task | Few-shot transfer | Paper · Project |
| ARIO | Unified format | Dataset standardization | Paper · Project |
| LIBERO | 130 tasks | Lifelong learning | Paper · Project |
| RoboMIND | Multi-embodiment | Intelligence benchmark | Paper · Project |
| VLABench | Long-horizon | Reasoning benchmark | Paper · Project |
| SIMPLER | Sim-to-real | Policy evaluation | Paper · Project |
| RoboCasa | Large-scale | Household tasks | Paper · Project |
| CALVIN | Long-horizon | Language-conditioned | Paper · Project |
| RLBench | 100 tasks | Manipulation benchmark | Paper · Project |
| ARNOLD | Realistic 3D | Language-grounded | Paper · Project |
| ALFRED | VLN + manipulation | Instruction following | Paper · Project |
| GenSim / GenSim2 | Procedural | Task generation | Paper · Project |
| MineDojo | Minecraft | Open-world learning | Paper · Project |
| RoboTwin 2.0 | Bimanual manipulation | Domain randomization | Paper · Project |
| RoboArena | Distributed evaluation | Real-world benchmark | Paper · Project |
| RoboCerebra | Long-horizon | Manipulation evaluation | Paper · Project |
| DivScene | Diverse scenes | Object navigation | Paper · Project |
| EWMBench | World model evaluation | Scene, motion, semantic | Paper · Code |
| ManipBench | VLM evaluation | Low-level manipulation | Paper · Project |
| RoboTwin | CVPR 2025 | Dual-arm benchmark | Paper · Code |
| RoboVerse | Unified platform | Scalable robot learning | Paper · Code |
| AutoEval | Autonomous evaluation | Real-world manipulation | Paper · Project |
| RoboFactory | Multi-robot collaboration | Compositional tasks | Paper · Project |
| BOSS | Observation shift | Long-horizon tasks | Paper |
| OpenBench | Smart logistics | Semantic navigation | Paper |
| EmbSpatial-Bench | Spatial understanding | Embodied tasks | Paper |
| Diverse Behaviors Benchmark | Human demonstrations | Imitation learning | Paper |
| RoboArena | Zero-shot evaluation | Real-world generalization | Project |
| RobotArena ∞ | Real-to-sim | Automatic evaluation | ICLR 2026 Submission |
| RoboCasa365 | 365 tasks, 2k+ scenes | Kitchen manipulation | ICLR 2026 Submission |
| WorldGym | World model evaluation | Policy evaluation | ICLR 2026 Submission |
| ManipulationNet | Zero-shot benchmark | Fair comparison | Project |
Simulation Platforms
| Platform | Focus | Links |
|---|---|---|
| ManiSkill3 | GPU-parallelized robotics | Paper · Project |
| Genesis | Differentiable physics | Project |
| Isaac Lab / Isaac Sim | NVIDIA robotics simulation | Project |
| MuJoCo Playground | Browser-based MuJoCo | Paper · Project |
| OmniGibson | High-fidelity home simulation | Paper · Project |
| Habitat 2.0 | Navigation & rearrangement | Paper · Project |
| BEHAVIOR-1K | 1,000 everyday activities | Paper · Project |
| iGibson | Interactive environments | Paper · Project |
| RoboSuite | Modular manipulation | Paper · Project |
| PyBullet | Lightweight physics for RL | Project |
| DexGarmentLab | Garment manipulation | Paper · Project |
| MuBlE | Task planning benchmark | Paper · Code |
| LocoMuJoCo | Locomotion benchmark | Docs · Code |
| BEHAVIOR Robot Suite | Whole-body manipulation | Paper · Project |
Companies & Projects
Companies
| Company | Focus | Key Products | Links |
|---|---|---|---|
| Physical Intelligence (π) | General-purpose robot foundation models | π₀, π₀.5, π₀.6, FAST | Web · Blog |
| Google DeepMind | Robotics research | RT-1/2, Gemini Robotics, Genie, PaLM-E | Web · Blog |
| OpenAI | AI research | CLIP, GPT-4V, Sora | Web · Blog |
| Meta AI (FAIR) | JEPA, embodied AI | I-JEPA, V-JEPA, R3M, DINOv2, SAM | Web · Blog |
| World Labs | Spatial intelligence & world models | Marble, RTFM | Web |
| NVIDIA | Simulation & foundation models | GR00T, Isaac Sim, Cosmos | Web · Blog |
| Microsoft Research | Multimodal agents | Magma | Web · Blog |
| Hugging Face | Open-source VLAs | LeRobot, SmolVLA | Web · Blog |
| ByteDance | Vision-language-action models | GR-1, GR-2, CogACT | Web |
| Shanghai AI Lab | Embodied AI research | LEO, InternVL | Web |
| Covariant | Industrial robotics AI | RFM-1 | Web · Blog |
| Skild AI | General-purpose robot brain | Skild Brain | Web |
| RLWRLD | Industrial robotics foundation models | RFM | Web · Blog |
| Intrinsic (Alphabet) | Industrial robotics software | Flowstate | Web · Blog |
| Wayve | Embodied AI for driving | GAIA-1, LINGO | Web · Blog |
| Cortex AI | Real-world data for embodied AI | Egocentric + robot datasets | Web |
| Verne Robotics | Mobile manipulation | Nemo | Web |
| Figure AI | Humanoid robots | Figure 01, Figure 02 | Web |
| 1X Technologies | Humanoid robots | NEO, EVE | Web · Blog |
| Boston Dynamics | Advanced robotics | Atlas, Spot, Stretch | Web · Blog |
| Tesla | Humanoid robots | Optimus | Web |
| Agility Robotics | Bipedal robots | Digit | Web · Blog |
| Unitree | Quadruped & humanoid robots | H1, G1, Go2 | Web |
| Sanctuary AI | Humanoid robots | Phoenix | Web · Blog |
| Apptronik | Humanoid robots | Apollo | Web |
| Fourier Intelligence | Humanoid & rehab robots | GR-1, GR-2 | Web |
| Hello Robot | Mobile manipulation | Stretch | Web · Blog |
| Franka Robotics | Research robot arms | Panda | Web |
| Universal Robots | Collaborative robot arms | UR3, UR5, UR10 | Web · Blog |
| UFACTORY | Affordable robot arms | xArm | Web |
| Trossen Robotics | Research platforms | ViperX, WidowX, ALOHA | Web |
| Flexiv | Adaptive robot arms | Rizon | Web |
Research Labs & Initiatives
| Organization | Notable Contributions | Links |
|---|---|---|
| Stanford IRIS Lab | Diffusion Policy, MimicPlay | Web |
| Stanford SVL | BEHAVIOR, OmniGibson, VoxPoser | Web |
| Stanford ILIAD | ACT, ALOHA, Mobile ALOHA | Web |
| Berkeley RAIL | Octo, BridgeData, R3M | Web |
| Berkeley BAIR | RT-X contributions, RoboAgent | Web · Blog |
| CMU Robotics Institute | HomeRobot, OK-Robot | Web |
| MIT CSAIL | LLM planning, manipulation | Web |
| NYU CILVR | OPEN TEACH, DynaMo, World Models | Web |
| Princeton REAL Lab | Manipulation research | Web |
| Columbia Robotics | Diffusion Policy, CLIPort | Web |
| Georgia Tech RIPL | LLM-Robotics survey | Web |
| UW RSE Lab | CLIPort, VLAs | Web |
| Toyota Research Institute | Prismatic VLMs, OpenVLA | Web · Blog |
| Tsinghua MARS Lab | LEO, CogACT | Web |
| Peking University | NaVid, various VLAs | Web |
| Open X-Embodiment | OXE dataset, RT-X | Web |
| DROID Collaboration | DROID dataset | Web |
Related Works
Other awesome lists and resources for robotics and embodied AI.
- Awesome World Models: [GitHub]
- Awesome-VLA-Robotics: [GitHub]
- Awesome-VLA-RL: [GitHub]
- Awesome-Robotics-Foundation-Models: [GitHub]
- Awesome-Generalist-Agents: [GitHub]
- Awesome-LLM-Robotics: [GitHub]
- Awesome World Models for Robotics: [GitHub]
- Awesome-VLA-Post-Training: [GitHub]
- Awesome-BFM-Papers: [GitHub]
- Awesome Embodied VLA/VA/VLN: [GitHub]
Citation
If you find this repository useful, please consider citing this list:
@misc{awesome-physical-ai,
title = {Awesome Physical AI},
author = {Keon Kim},
journal = {GitHub repository},
url = {https://github.com/keon/awesome-physical-ai},
year = {2026},
}
Contributing
We welcome contributions! Please submit a pull request to add relevant papers, correct errors, or improve organization.
Guidelines
- Focus on Physical AI papers (robotics, embodied agents, world models, VLAs)
- Each paper should appear in only one category
- Include proper citations with links to papers, projects, and code
- Verify all links are working