Foundation Models Meet Embodied Agents

@ ICCV 2025 Tutorial

4:00 PM – 8:00 PM PDT, October 20, 2025

Hawaii Convention Center, Honolulu, Hawaii


Foundation Models Meet Embodied Agents

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects).

We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot's decision making process.

This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.

🔗 More details on the ICCV 2025 tutorial page.

Presenters

Manling Li

Manling Li

Northwestern University

Yunzhu Li

Yunzhu Li

Columbia University

Wenlong Huang

Wenlong Huang

Stanford University

Contact

Please email manling.li@u.northwestern.edu if you have any questions.