Foundation Models Meet Embodied Agents

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects).

We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot's decision making process.

This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.

🔗 More details on the ICCV 2025 tutorial page.

Schedule

Session	Duration	Time (HST)	Presenter	Slides/Video
Motivation and Overview	15min	1:00-1:15 PM	Manling Li	Slides, Video (Upcoming)
Foundation Models meet Virtual Agents	45min	1:15-2:00 PM	Manling Li	Slides, Video (Upcoming)
Foundation Models meet Physical Agents: Overview & Perception	25min	2:00-2:25 PM	Jiayuan Mao	Slides, Video (Upcoming)
Foundation Models meet Physical Agents: High-Level and Low-level Decision Making	50min	2:25-3:15 PM	Wenlong Huang	Slides, Video (Upcoming)
Break	30min	3:15-3:45 PM
Robotic Foundation Models	30min	3:45-4:15 PM	Yunzhu Li	Slides, Video (Upcoming)
Remaining Challenges	15min	4:15-4:30 PM	Yunzhu Li	Slides, Video (Upcoming)
QA	30min	4:30-5:00 PM