Foundation Models Meet Embodied Agents

@ CVPR 2025 Workshop

Wed June 11th, 2025, Room 214

at the Music City Center, Nashville TN

Call for Papers

Submission Topics

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of Large Language Models as powerful tools for building Large Agent Models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects).

However, moving from Foundation Models to Embodied Agents poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for reliable embodied decision-making. We will cover the advances of the foundation models into Large Language Models Vision-Language Models, and Vision-Language-Action Models. In this workshop, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and present a structured view to investigate the robot’s decision-making process.

We welcome submissions on all topics related to Foundation Models and their interactions with Embodied Agents. We will also announce a Best Paper Award at our workshop.

Submission Instructions

We welcome submissions covering:

Research papers: Long papers (8 pages) showcasing novel findings, methods, or theoretical advancements.
Short/Abstract papers: Features exploratory work (4 pages or 2 pages excluding references) that may be preliminary but presents innovative concepts, early results, or thought-provoking viewpoints that stimulate discussion and future work.
Position papers: Offer critical perspectives on trends and challenges within the field (no less than 8 pages).
Survey papers: Provide thorough reviews of specific topics, mapping the current research landscape and suggesting directions for future exploration (no less than 8 pages).

All formats allow unlimited references and appendices.

Contributions will be non-archival but hosted on our workshop website, and thus dual submission is allowed where permitted by third parties. We welcome submissions that are under submission or accepted by other conferences. Please mention it in the last sentence of the paper abstract if your paper has been under submission or accepted by other conferences. Paper awards will prefer the original submissions.

Submissions should follow CVPR two-column style and be anonymous; see the CVPR-25 author kit for details. Please submit through OpenReview submission portal.

We are looking for program committee members. Please sign up at this form.

Important Dates

All deadlines are 11:59 pm UTC-12h (“Anywhere on Earth”).

Submission Deadline	~~May 1st 2025~~ May 17th 2025 (23:59pm AoE)
Call for Program Committee Members	~~May 1st 2025~~ May 17th 2025 (23:59pm AoE)
Decision Notifications	May 25th 2025 (23:59pm AoE)
Camera-Ready Deadline (Non-Archival)	May 31st 2025 (23:59pm AoE)
Workshop Date	June 11th 2025

Speakers

Jitendra Malik

UC Berkeley

Yilun Du

Google Deepmind; Harvard

Michael Black

MPI

Shuang Li

Stanford University

Ranjay Krishna

University of Washington; AI2

Katerina Fragkiadaki

Carnegie Mellon University

Schedule

Time	Program
09:00-09:10	Opening Remarks
09:10-09:55	Keynote Speech - Ranjay Krishna: Behaviors & bodies: how they shape one another [In person] Existing robot datasets contain a few embodiments but very few demonstrations per embodiment, resulting in no cross-embodiment generalization. In this talk, I will explore the possibility of using synthetic data to develop a foundation model for embodied navigation. I will first introduce our work on creating diverse synthetic environments from schools, to offices, to museums, to houses. Then, I will introduce a series of training algorithms that scale up navigation agents with synthetic training data, demonstrating sim-to-real transfer, and cross-embodiment generalization. Finally, I will highlight how embodiment-agnostic policies can enable tractable search algorithms, allowing us to explore the space of possible morphologies to identify effective ones that generalize to new unseen tasks.
09:55-10:30	Oral Presentation Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model (Best Paper) One Demo Is All It Takes: Planning Domain Derivation with LLMs from a Single Demonstration
10:30-11:00	Coffee Break & Poster at ExHall D
11:00-11:45	Keynote Speech - Jitendra Malik [Virtual]
11:45-12:20	Oral Presentation Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation VAGEN: Reinforcing Multi-Turn Visual State Reasoning for VLM Agents
12:20-13:30	Lunch Break + Student Mentoring Session
13:30-14:15	Keynote Speech - Katerina Fragkiadaki: Goal-Driven Reasoning for Multimodal AI Agents [In person] While large language models (LLMs) have shown impressive capabilities in text-based tasks, they remain limited in their ability to operate as autonomous agents capable of performing complex, real-world activities. Realizing applications such as web-based agents, augmented reality assistants, and personal robots requires moving beyond static knowledge retrieval toward systems with persistent memory, multimodal reasoning, and interactive learning. In this talk, I will present methods that address these key challenges: (1) scalable memory architectures that enable agents to plan and reason over extended time horizons; (2) human-in-the-loop learning frameworks that enhance internal reasoning and decision-making through exploration and feedback; (3) a reinforcement learning approach for grounded visual reasoning, where text thoughts are explicitly grounded on image evidence. Together, these contributions move us closer to building agents that can perceive, plan, adapt, and improve continuously in dynamic environments—bridging the gap between today’s static LLMs and tomorrow’s general-purpose autonomous systems.
14:15-15:00	Keynote Speech - Shuang Li: How Vision and Language Models Are Changing Decision-Making [Virtual] Recent advances in vision and language models are transforming decision-making processes in robotics and automation. In this talk, we present several innovative frameworks that leverage language models as high-level planners to break down complex tasks into manageable subtasks, while employing video generation models as low-level controllers to execute actions in dynamic environments. We also introduce a novel unified video-and-action method that overcomes the limitations of directly applying video generation techniques to robotics and enables rapid policy inference. This integrated approach significantly enhances both the interpretability and efficiency of robotic systems.
15:00-15:10	Abaka AI: Leading multimodal, text, and robotics data collection and annotation - Tom Tang
15:10-15:30	Coffee Break
15:30-16:15	Keynote Speech - Yilun Du: Building Flexible Embodied Agents through Compositional World Modeling [In person] A major bottleneck towards constructing intelligent embodied agents is the lack a of available data for all the settings the agent might find itself in. I’ll illustrate how we can operate well in such scenarios by building a “world model” then using inference/planning to solve new tasks the agent encounters. I’ll present a particular instantiation of a such a “world model”, using compositional energy functions, enabling models to generalize in areas we do not have data in. I illustrate a set of results using such an approach across perception, reasoning, and decision making.
16:15-17:00	Keynote Speech - Michael Black: Towards the 3D Human Foundation Agent [In person] This talk will describe current progress on building a 3D Human Foundation Agent (HFA) that can perceive the world and the humans in it. The HFA is a digitally embodied agent that understands human behavior and responds to it using its “motor system” to translate its goals into 3D actions. The Human Foundation Agent must (1) perceive human movement in 3D, (2) understand the goals, implications, and emotions inherent in that movement, and (3) plan and generate natural motor activity to (4) drive a digital or physical embodiment that interacts with real or virtual humans in real or virtual 3D worlds. This talk will focus on current progress and the path to building HFAs through 3D human motion capture from video, synthetic training data, generative behavior modeling, AI-driven graphics, and large vision-language models that are fine-tuned to understand 3D humans. HFAs will radically change how people interact with machines. So much so that a child born today will have trouble imagining a world in which technology doesn’t understand their motions and behaviors.
17:00-17:05	Best Paper Announcement
17:05-17:15	QA & Closing Remarks

Accepted Papers

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks (oral)
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model (oral)
Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang

One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration (oral)
Jinbang Huang, Yixin Xiao, Zhanguang Zhang, Mark Coates, Jianye HAO, Yingxue Zhang

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations (oral)
Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation (oral)
Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, Jiafei Duan

VAGEN: Reinforcing Multi-Turn Visual State Reasoning for VLM Agents (oral)
Kangrui Wang*, Pingyue Zhang*, Zihan Wang*, Yaning Gao*, Linjie Li*, Qineng Wang, Hanyang Chen, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li

Embodied AI with Knowledge Graphs: Material-Aware Obstacle Handling for Autonomous Agents
Ayush Bheemaiah, Seungyong Yang

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
Bosung Kim, Prithviraj Ammanabrolu

Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić

Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao

Slot-Level Robotic Placement via Visual Imitation from Single Human Video
Dandan Shan, Kaichun Mo, Wei Yang, Yu-Wei Chao, David Fouhey, Dieter Fox, Arsalan Mousavian

Interactive Post-Training for Vision-Language-Action Models
Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Kraehenbuehl

Human-like Navigation in a World Built for Humans
Bhargav Chandaka, Gloria X. Wang, Haozhe Chen, Henry Che, Albert J. Zhai, Shenlong Wang

AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives
Aniruddh Sikdar, Aditya Gandhamal, Suresh Sundaram

Episodic Memory Banks for Lifelong Robot Learning: A Case Study Focusing on Household Navigation and Manipulation
Zichao Li

TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation
Navid Rajabi, Jana Kosecka

Uncertainty Modeling in Autonomous Vehicle Trajectory Prediction: A Comprehensive Survey
Siddharth Raina, Jeshwanth Challagundla, Mantek Singh

Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation
Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, Haoping Xu, Guowei Huang, zhanpeng zhang, Tongtong Cao, Weichao Qiu, Xingyue Quan, Jianye HAO, Yuzheng Zhuang, Yingxue Zhang

SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models
Arnab Debnath, Gregory J. Stein, Jana Kosecka

ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
Junyao Shi, Zhuolun Zhao, Tianyou Wang, Ian Pedroza, Amy Luo, Jie Wang, Yecheng Jason Ma, Dinesh Jayaraman

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
Baiting Luo, Abhishek Dubey, Ayan Mukhopadhyay