Foundation Models Meet Embodied Agents

@ 2025 BEHAVIOR Challenge

December 6-7, 2025

San Diego Convention Center, San Diego, California

2025 BEHAVIOR Challenge

Robots in the BEHAVIOR simulator perform everyday activities (like preparing food) in virtual home environments. BEHAVIOR (Benchmark for Everyday Household Activities in Virtual, Interactive, and Realistic environments) is a large-scale embodied AI benchmark with 1,000 defined household tasks grounded in real human needs. These tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications.

Even state-of-the-art robot learning models still struggle with the complexity and extended duration of BEHAVIOR's activities, which is why we are thrilled to announce the 1st BEHAVIOR Challenge at NeurIPS 2025. This competition invites the community to tackle 50 full-length tasks in a realistic simulator — pushing the frontiers of both high-level planning and low-level control in house-scale environments.

Participants will need to make progress on hierarchical planning, robust perception under realistic visual conditions, and reliable manipulation across long-horizon episodes. By focusing on full-length, human-scale household tasks, the challenge aims to surface the practical limitations of current methods and drive advances that matter for real-world robot deployments.

🔗More information on the official 2025 BEHAVIOR challenge website.

🧩 Challenge Components

📋 Task Definitions

The benchmark includes 1,000 everyday household activities covering diverse behaviors across:

Rearrangement - organizing and placing objects
Cleaning/Wiping - maintaining cleanliness
Cooking/Freezing - food preparation and storage
Painting/Spraying - surface treatment tasks
Hanging/Installing - mounting and assembly
Slicing/Dicing - precise cutting operations
Baking - complex cooking procedures
Doing Laundry - textile care activities

🏠 Interactive Environments

50 fully interactive scenes with house-scale layouts

10,000+ richly annotated objects

🎮 OmniGibson Simulator

The simulation environment supports:

Rigid body physics - realistic object interactions
Deformable objects (cloth, fabric) - soft body dynamics
Fluid interactions (water, oils) - liquid simulation
Object semantic states (e.g., open, filled, on-top, inside, etc.) - rich state representation

⚖️ Data and Baselines

📚 Dataset

The benchmark includes 10,000 human-demonstrated trajectories with diverse behaviors across all task categories. Each demonstration contains:

Synchronized RGBD observations - multi-modal visual data
Object and part-level segmentation masks - precise object identification
Ground-truth object states - semantic state annotations
Robot proprioception - internal sensor data
Robot actions - complete action sequences
Skill and subtask annotations - hierarchical task decomposition

🤖 Available Baseline Methods

Participants have access to training and evaluation pipelines for these baseline methods:

ACT - Action Chunking Transformer
Diffusion Policy - Diffusion-based control
BC-RNN - Behavioral cloning with RNNs

WB-VIMA - Multimodal imitation learning
OpenVLA - Vision-language-action models
π0 - Foundation policy models

📊 Evaluation

🏁 Challenge Tracks

We have two tracks for the 2025 BEHAVIOR challenge:

Standard Track

Participants use only provided state observations:

RGB + depth + instance segmentation + proprioception
No object state information

Privileged Information Track

Participants can query for any privileged information:

Target object poses
Scene point cloud
Any other simulator information

🏆 Prizes for Each Track

🥇 $1,000

First Place

🥈 $500

Second Place

🥉 $300

Third Place

📈 Performance Metrics

🎯 Primary Metric (Ranking)

Task success rate averaged across 50 tasks.

Partial successes are counted as:
satisfied BDDL predicates ÷ total goal predicates

⚡ Secondary Metrics (Efficiency)

Simulated time: Total simulation steps
Distance navigated: Accumulated distance traveled by agent base
Hand displacement: Accumulated displacement of agent hands

* Secondary metrics normalized using human averages from 200 demonstrations per task

📋 Evaluation Protocol

1️⃣

Training Phase

Training instances and human demonstrations (200 per task) released publicly

2️⃣

Self-Evaluation

20 validation instances provided

Evaluate 5 times per instance, submit scores via Google Form

3️⃣

Final Evaluation

20 held-out instances for final evaluation

Top-5 solutions evaluated after November 15th, 2025 freeze

📝 Instance Variations: Each instance differs in initial object states and initial robot poses

🕐 Challenge Office Hours

Every Monday and Thursday, 4:30pm-6:00pm PST

Join us over Zoom for support and Q&A

🚀 How to Participate

📋 Submission Details

📝 Submit Your Results

Submit your results and models via our official Google Form:

📄 Google Form Submission

📊 Self-Evaluation Resources

To self-report your performance:

📖 Check our evaluation guide
🏆 View the leaderboard

💡 We encourage submitting intermediate results to be showcased on our leaderboard!

🎯 Final Model Submission and Evaluation

💻 Submitted Models & Compute Specs

Hardware Requirements:

🖥️ Model should run on a single 24GB VRAM GPU
⚙️ Final evaluation will use: RTX 3090, A5000, TitanRTX

Note: The same model with different checkpoints from the same team will be considered as a single entry.

🌐 IP Address-Based Evaluation

Model Serving Options:

🔗 Serve your models and provide IP addresses for evaluation queries
📚 Recommended serving libraries:

TorchServe
LitServe
vLLM
NVIDIA Triton

Contact

Please email behavior-contact@googlegroups.com if you have any questions.

Email Us

Foundation Models Meet Embodied Agents @ 2025 BEHAVIOR Challenge