Foundation Models Meet Embodied Agents

@ 2025 BEHAVIOR Challenge

December 6-7, 2025

San Diego Convention Center, San Diego, California


2025 BEHAVIOR Challenge

Robots in the BEHAVIOR simulator perform everyday activities (like preparing food) in virtual home environments. BEHAVIOR (Benchmark for Everyday Household Activities in Virtual, Interactive, and Realistic environments) is a large-scale embodied AI benchmark with 1,000 defined household tasks grounded in real human needs. These tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications.

Even state-of-the-art robot learning models still struggle with the complexity and extended duration of BEHAVIOR's activities, which is why we are thrilled to announce the 1st BEHAVIOR Challenge at NeurIPS 2025. This competition invites the community to tackle 50 full-length tasks in a realistic simulator โ€” pushing the frontiers of both high-level planning and low-level control in house-scale environments.

Participants will need to make progress on hierarchical planning, robust perception under realistic visual conditions, and reliable manipulation across long-horizon episodes. By focusing on full-length, human-scale household tasks, the challenge aims to surface the practical limitations of current methods and drive advances that matter for real-world robot deployments.

๐Ÿ”—More information on the official 2025 BEHAVIOR challenge website.

๐Ÿงฉ Challenge Components

๐Ÿ“‹ Task Definitions

The benchmark includes 1,000 everyday household activities covering diverse behaviors across:

  • Rearrangement - organizing and placing objects
  • Cleaning/Wiping - maintaining cleanliness
  • Cooking/Freezing - food preparation and storage
  • Painting/Spraying - surface treatment tasks
  • Hanging/Installing - mounting and assembly
  • Slicing/Dicing - precise cutting operations
  • Baking - complex cooking procedures
  • Doing Laundry - textile care activities

๐Ÿ  Interactive Environments

50 fully interactive scenes with house-scale layouts

10,000+ richly annotated objects

๐ŸŽฎ OmniGibson Simulator

The simulation environment supports:

  • Rigid body physics - realistic object interactions
  • Deformable objects (cloth, fabric) - soft body dynamics
  • Fluid interactions (water, oils) - liquid simulation
  • Object semantic states (e.g., open, filled, on-top, inside, etc.) - rich state representation

โš–๏ธ Data and Baselines

๐Ÿ“š Dataset

The benchmark includes 10,000 human-demonstrated trajectories with diverse behaviors across all task categories. Each demonstration contains:

  • Synchronized RGBD observations - multi-modal visual data
  • Object and part-level segmentation masks - precise object identification
  • Ground-truth object states - semantic state annotations
  • Robot proprioception - internal sensor data
  • Robot actions - complete action sequences
  • Skill and subtask annotations - hierarchical task decomposition

๐Ÿค– Available Baseline Methods

Participants have access to training and evaluation pipelines for these baseline methods:

  • ACT - Action Chunking Transformer
  • Diffusion Policy - Diffusion-based control
  • BC-RNN - Behavioral cloning with RNNs
  • WB-VIMA - Multimodal imitation learning
  • OpenVLA - Vision-language-action models
  • ฯ€0 - Foundation policy models

๐Ÿ“Š Evaluation

๐Ÿ Challenge Tracks

We have two tracks for the 2025 BEHAVIOR challenge:

Standard Track

Participants use only provided state observations:

  • RGB + depth + instance segmentation + proprioception
  • No object state information

Privileged Information Track

Participants can query for any privileged information:

  • Target object poses
  • Scene point cloud
  • Any other simulator information

๐Ÿ† Prizes for Each Track

๐Ÿฅ‡ $1,000
First Place
๐Ÿฅˆ $500
Second Place
๐Ÿฅ‰ $300
Third Place

๐Ÿ“ˆ Performance Metrics

๐ŸŽฏ Primary Metric (Ranking)

Task success rate averaged across 50 tasks.

Partial successes are counted as:
satisfied BDDL predicates รท total goal predicates

โšก Secondary Metrics (Efficiency)

  • Simulated time: Total simulation steps
  • Distance navigated: Accumulated distance traveled by agent base
  • Hand displacement: Accumulated displacement of agent hands

* Secondary metrics normalized using human averages from 200 demonstrations per task

๐Ÿ“‹ Evaluation Protocol

1๏ธโƒฃ

Training Phase

Training instances and human demonstrations (200 per task) released publicly

2๏ธโƒฃ

Self-Evaluation

20 validation instances provided

Evaluate 5 times per instance, submit scores via Google Form

3๏ธโƒฃ

Final Evaluation

20 held-out instances for final evaluation

Top-5 solutions evaluated after November 15th, 2025 freeze

๐Ÿ“ Instance Variations: Each instance differs in initial object states and initial robot poses

๐Ÿ• Challenge Office Hours

Every Monday and Thursday, 4:30pm-6:00pm PST

Join us over Zoom for support and Q&A

๐Ÿš€ How to Participate

๐Ÿ“‹ Submission Details

๐Ÿ“ Submit Your Results

Submit your results and models via our official Google Form:

๐Ÿ“Š Self-Evaluation Resources

To self-report your performance:

๐Ÿ’ก We encourage submitting intermediate results to be showcased on our leaderboard!

๐ŸŽฏ Final Model Submission and Evaluation

๐Ÿ’ป Submitted Models & Compute Specs

Hardware Requirements:

  • ๐Ÿ–ฅ๏ธ Model should run on a single 24GB VRAM GPU
  • โš™๏ธ Final evaluation will use: RTX 3090, A5000, TitanRTX

Note: The same model with different checkpoints from the same team will be considered as a single entry.

๐ŸŒ IP Address-Based Evaluation

Model Serving Options:

  • ๐Ÿ”— Serve your models and provide IP addresses for evaluation queries
  • ๐Ÿ“š Recommended serving libraries:
    • TorchServe
    • LitServe
    • vLLM
    • NVIDIA Triton
Contact

Please email behavior-contact@googlegroups.com if you have any questions.