Foundation Models Meet Embodied Agents
@ CVPR 2025 Workshop
Wed June 11th, 2025, Room 214
at the Music City Center, Nashville TN
Submission Topics
An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of Large Language Models as powerful tools for building Large Agent Models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects).
However, moving from Foundation Models to Embodied Agents poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for reliable embodied decision-making. We will cover the advances of the foundation models into Large Language Models Vision-Language Models, and Vision-Language-Action Models. In this workshop, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and present a structured view to investigate the robot’s decision-making process.
We welcome submissions on all topics related to Foundation Models and their interactions with Embodied Agents. We will also announce a Best Paper Award at our workshop.
Submission Instructions
We welcome submissions covering:
All formats allow unlimited references and appendices.
Contributions will be non-archival but hosted on our workshop website, and thus dual submission is allowed where permitted by third parties. We welcome submissions that are under submission or accepted by other conferences. Please mention it in the last sentence of the paper abstract if your paper has been under submission or accepted by other conferences. Paper awards will prefer the original submissions.
Submissions should follow CVPR two-column style and be anonymous; see the CVPR-25 author kit for details. Please submit through OpenReview submission portal.
We are looking for program committee members. Please sign up at this form.
All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).
Submission Deadline | |
---|---|
Call for Program Committee Members | |
Decision Notifications | May 25th 2025 (23:59pm AoE) |
Camera-Ready Deadline (Non-Archival) | May 31st 2025 (23:59pm AoE) |
Workshop Date | June 11th 2025 |
Time | Program |
---|---|
09:00-09:10 | Opening Remarks |
09:10-09:55 | Keynote Speech - Ranjay Krishna: Behaviors & bodies: how they shape one another [In person] Existing robot datasets contain a few embodiments but very few demonstrations per embodiment, resulting in no cross-embodiment generalization. In this talk, I will explore the possibility of using synthetic data to develop a foundation model for embodied navigation. I will first introduce our work on creating diverse synthetic environments from schools, to offices, to museums, to houses. Then, I will introduce a series of training algorithms that scale up navigation agents with synthetic training data, demonstrating sim-to-real transfer, and cross-embodiment generalization. Finally, I will highlight how embodiment-agnostic policies can enable tractable search algorithms, allowing us to explore the space of possible morphologies to identify effective ones that generalize to new unseen tasks. |
09:55-10:30 | Oral Presentation Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model (Best Paper) One Demo Is All It Takes: Planning Domain Derivation with LLMs from a Single Demonstration |
10:30-11:00 | Coffee Break & Poster at ExHall D |
11:00-11:45 | Keynote Speech - Jitendra Malik [Virtual] |
11:45-12:20 | Oral Presentation Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation VAGEN: Reinforcing Multi-Turn Visual State Reasoning for VLM Agents |
12:20-13:30 | Lunch Break + Student Mentoring Session |
13:30-14:15 | Keynote Speech - Katerina Fragkiadaki: Goal-Driven Reasoning for Multimodal AI Agents [In person] While large language models (LLMs) have shown impressive capabilities in text-based tasks, they remain limited in their ability to operate as autonomous agents capable of performing complex, real-world activities. Realizing applications such as web-based agents, augmented reality assistants, and personal robots requires moving beyond static knowledge retrieval toward systems with persistent memory, multimodal reasoning, and interactive learning. In this talk, I will present methods that address these key challenges: (1) scalable memory architectures that enable agents to plan and reason over extended time horizons; (2) human-in-the-loop learning frameworks that enhance internal reasoning and decision-making through exploration and feedback; (3) a reinforcement learning approach for grounded visual reasoning, where text thoughts are explicitly grounded on image evidence. Together, these contributions move us closer to building agents that can perceive, plan, adapt, and improve continuously in dynamic environments—bridging the gap between today’s static LLMs and tomorrow’s general-purpose autonomous systems. |
14:15-15:00 | Keynote Speech - Shuang Li: How Vision and Language Models Are Changing Decision-Making [Virtual] Recent advances in vision and language models are transforming decision-making processes in robotics and automation. In this talk, we present several innovative frameworks that leverage language models as high-level planners to break down complex tasks into manageable subtasks, while employing video generation models as low-level controllers to execute actions in dynamic environments. We also introduce a novel unified video-and-action method that overcomes the limitations of directly applying video generation techniques to robotics and enables rapid policy inference. This integrated approach significantly enhances both the interpretability and efficiency of robotic systems. |
15:00-15:10 | Abaka AI: Leading multimodal, text, and robotics data collection and annotation - Tom Tang |
15:10-15:30 | Coffee Break |
15:30-16:15 | Keynote Speech - Yilun Du: Building Flexible Embodied Agents through Compositional World Modeling [In person] A major bottleneck towards constructing intelligent embodied agents is the lack a of available data for all the settings the agent might find itself in. I’ll illustrate how we can operate well in such scenarios by building a “world model” then using inference/planning to solve new tasks the agent encounters. I’ll present a particular instantiation of a such a “world model”, using compositional energy functions, enabling models to generalize in areas we do not have data in. I illustrate a set of results using such an approach across perception, reasoning, and decision making. |
16:15-17:00 | Keynote Speech - Michael Black: Towards the 3D Human Foundation Agent [In person] This talk will describe current progress on building a 3D Human Foundation Agent (HFA) that can perceive the world and the humans in it. The HFA is a digitally embodied agent that understands human behavior and responds to it using its “motor system” to translate its goals into 3D actions. The Human Foundation Agent must (1) perceive human movement in 3D, (2) understand the goals, implications, and emotions inherent in that movement, and (3) plan and generate natural motor activity to (4) drive a digital or physical embodiment that interacts with real or virtual humans in real or virtual 3D worlds. This talk will focus on current progress and the path to building HFAs through 3D human motion capture from video, synthetic training data, generative behavior modeling, AI-driven graphics, and large vision-language models that are fine-tuned to understand 3D humans. HFAs will radically change how people interact with machines. So much so that a child born today will have trouble imagining a world in which technology doesn’t understand their motions and behaviors. |
17:00-17:05 | Best Paper Announcement |
17:05-17:15 | QA & Closing Remarks |
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks (oral)
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model (oral)
Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang
One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration (oral)
Jinbang Huang, Yixin Xiao, Zhanguang Zhang, Mark Coates, Jianye HAO, Yingxue Zhang
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations (oral)
Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li
SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation (oral)
Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, Jiafei Duan
VAGEN: Reinforcing Multi-Turn Visual State Reasoning for VLM Agents (oral)
Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li
Embodied AI with Knowledge Graphs: Material-Aware Obstacle Handling for Autonomous Agents
Ayush Bheemaiah, Seungyong Yang
Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
Bosung Kim, Prithviraj Ammanabrolu
Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao
Slot-Level Robotic Placement via Visual Imitation from Single Human Video
Dandan Shan, Kaichun Mo, Wei Yang, Yu-Wei Chao, David Fouhey, Dieter Fox, Arsalan Mousavian
Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Kraehenbuehl
Human-like Navigation in a World Built for Humans
Bhargav Chandaka, Gloria X. Wang, Haozhe Chen, Henry Che, Albert J. Zhai, Shenlong Wang
AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives
Aniruddh Sikdar, Aditya Gandhamal, Suresh Sundaram
Episodic Memory Banks for Lifelong Robot Learning: A Case Study Focusing on Household Navigation and Manipulation
Zichao Li
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
Navid Rajabi, Jana Kosecka
Uncertainty Modeling in Autonomous Vehicle Trajectory Prediction: A Comprehensive Survey
Siddharth Raina, Jeshwanth Challagundla, Mantek Singh
Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation
Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, Haoping Xu, Guowei Huang, zhanpeng zhang, Tongtong Cao, Weichao Qiu, Xingyue Quan, Jianye HAO, Yuzheng Zhuang, Yingxue Zhang
SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models
Arnab Debnath, Gregory J. Stein, Jana Kosecka
ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
Junyao Shi, Zhuolun Zhao, Tianyou Wang, Ian Pedroza, Amy Luo, Jie Wang, Yecheng Jason Ma, Dinesh Jayaraman
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
Baiting Luo, Abhishek Dubey, Ayan Mukhopadhyay
Please email cvpr2025-foundationmodel-embodied@googlegroups.com if you have any questions.