An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of Large Language Models as powerful tools for building Large Agent Models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects).
However, moving from Foundation Models to Embodied Agents poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for reliable embodied decision-making. We will cover the advances of the foundation models into Large Language Models Vision-Language Models, and Vision-Language-Action Models. In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and present a structured view to investigate the robot’s decision-making process.
Session | Duration | Time | Presenter | Slides/Video |
---|---|---|---|---|
Motivation and Overview | 15min | 08:30-08:45 | Manling Li | Slides, Video (Upcoming) |
Foundation Models meet Virtual Agents | 45min | 08:45-09:30 | Manling Li | Slides, Video (Upcoming) |
Foundation Models meet Physical Agents: Overview & High-level Decision Making | 25min | 09:30-09:55 | Jiayuan Mao | Slides, Video (Upcoming) |
Foundation Models meet Physical Agents: Low-level Decision Making | 50min | 09:55-10:45 | Wenlong Huang | Slides, Video (Upcoming) |
Break | 30min | 10:45-11:15 | ||
Robotic Foundation Models | 30min | 11:15-11:45 | Yunzhu Li | Slides, Video (Upcoming) |
Remaining Challenges | 15min | 11:45-12:00 | Yunzhu Li | Slides, Video (Upcoming) |
QA | 30min | 12:00-12:30 |