Foundation Models Meet Embodied Agents
@ 2025 BEHAVIOR Challenge
December 6-7, 2025
San Diego Convention Center, San Diego, California
Robots in the BEHAVIOR simulator perform everyday activities (like preparing food) in virtual home environments. BEHAVIOR (Benchmark for Everyday Household Activities in Virtual, Interactive, and Realistic environments) is a large-scale embodied AI benchmark with 1,000 defined household tasks grounded in real human needs. These tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications.
Even state-of-the-art robot learning models still struggle with the complexity and extended duration of BEHAVIOR's activities, which is why we are thrilled to announce the 1st BEHAVIOR Challenge at NeurIPS 2025. This competition invites the community to tackle 50 full-length tasks in a realistic simulator โ pushing the frontiers of both high-level planning and low-level control in house-scale environments.
Participants will need to make progress on hierarchical planning, robust perception under realistic visual conditions, and reliable manipulation across long-horizon episodes. By focusing on full-length, human-scale household tasks, the challenge aims to surface the practical limitations of current methods and drive advances that matter for real-world robot deployments.
๐More information on the official 2025 BEHAVIOR challenge website.
The benchmark includes 1,000 everyday household activities covering diverse behaviors across:
50 fully interactive scenes with house-scale layouts
10,000+ richly annotated objects
The simulation environment supports:
The benchmark includes 10,000 human-demonstrated trajectories with diverse behaviors across all task categories. Each demonstration contains:
Participants have access to training and evaluation pipelines for these baseline methods:
We have two tracks for the 2025 BEHAVIOR challenge:
Participants use only provided state observations:
Participants can query for any privileged information:
Task success rate averaged across 50 tasks.
Partial successes are counted as:
satisfied BDDL predicates รท total goal predicates
* Secondary metrics normalized using human averages from 200 demonstrations per task
Training instances and human demonstrations (200 per task) released publicly
20 validation instances provided
Evaluate 5 times per instance, submit scores via Google Form
20 held-out instances for final evaluation
Top-5 solutions evaluated after November 15th, 2025 freeze
๐ Instance Variations: Each instance differs in initial object states and initial robot poses
Every Monday and Thursday, 4:30pm-6:00pm PST
Join us over Zoom for support and Q&A
Submit your results and models via our official Google Form:
To self-report your performance:
๐ก We encourage submitting intermediate results to be showcased on our leaderboard!
Hardware Requirements:
Note: The same model with different checkpoints from the same team will be considered as a single entry.
Model Serving Options: