LeRobot documentation

Adding a New Benchmark

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.5.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Adding a New Benchmark

This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.

A benchmark in LeRobot is a set of Gymnasium environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard gym.Env interface. The lerobot-eval CLI then runs evaluation uniformly across all benchmarks.

Existing benchmarks at a glance

Before diving in, here is what is already integrated:

Benchmark Env file Config class Tasks Action dim Processor
LIBERO envs/libero.py LiberoEnv 130 across 5 suites 7 LiberoProcessorStep
Meta-World envs/metaworld.py MetaworldEnv 50 (MT50) 4 None
IsaacLab Arena Hub-hosted IsaaclabArenaEnv Configurable Configurable IsaaclabArenaProcessorStep

Use src/lerobot/envs/libero.py and src/lerobot/envs/metaworld.py as reference implementations.

How it all fits together

Data flow

During evaluation, data moves through four stages:

1. gym.Env  ──→  raw observations (numpy dicts)

2. Preprocessing  ──→  standard LeRobot keys + task description
   (preprocess_observation in envs/utils.py, env.call("task_description"))

3. Processors  ──→  env-specific then policy-specific transforms
   (env_preprocessor, policy_preprocessor)

4. Policy  ──→  select_action()  ──→  action tensor
   then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()

Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).

Environment structure

make_env() returns a nested dict of vectorized environments:

dict[str, dict[int, gym.vector.VectorEnv]]
#    ^suite       ^task_id

A single-task env (e.g. PushT) looks like {"pusht": {0: vec_env}}. A multi-task benchmark (e.g. LIBERO) looks like {"libero_spatial": {0: vec0, 1: vec1, ...}, ...}.

How evaluation runs

All benchmarks are evaluated the same way by lerobot-eval:

  1. make_env() builds the nested {suite: {task_id: VectorEnv}} dict.
  2. eval_policy_all() iterates over every suite and task.
  3. For each task, it runs n_episodes rollouts via rollout().
  4. Results are aggregated hierarchically: episode, task, suite, overall.
  5. Metrics include pc_success (success rate), avg_sum_reward, and avg_max_reward.

The critical piece: your env must return info["is_success"] on every step() call. This is how the eval loop knows whether a task was completed.

What your environment must provide

LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.

Env attributes

Your gym.Env must set these attributes:

Attribute Type Why
_max_episode_steps int rollout() uses this to cap episode length
task_description str Passed to VLA policies as a language instruction
task str Fallback identifier if task_description is not set

Success reporting

Your step() and reset() must include "is_success" in the info dict:

info = {"is_success": True}   # or False
return observation, reward, terminated, truncated, info

Observations

The simplest approach is to map your simulator’s outputs to the standard keys that preprocess_observation() already understands. Do this inside your gym.Env (e.g. in a _format_raw_obs() helper):

Your env should output LeRobot maps it to What it is
"pixels" (single array) observation.image Single camera image, HWC uint8
"pixels" (dict) observation.images.<cam> Multiple cameras, each HWC uint8
"agent_pos" observation.state Proprioceptive state vector
"environment_state" observation.env_state Full environment state (e.g. PushT)
"robot_state" observation.robot_state Nested robot state dict (e.g. LIBERO)

If your simulator uses different key names, you have two options:

  1. Recommended: Rename them to the standard keys inside your gym.Env wrapper.
  2. Alternative: Write an env processor to transform observations after preprocess_observation() runs (see step 4 below).

Actions

Actions are continuous numpy arrays in a gym.spaces.Box. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their input_features / output_features config.

Feature declaration

Each EnvConfig subclass declares two dicts that tell the policy what to expect:

  • features — maps feature names to PolicyFeature(type, shape) (e.g. action dim, image shape).
  • features_map — maps raw observation keys to LeRobot convention keys (e.g. "agent_pos" to "observation.state").

Step by step

At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig subclass** with a `create_envs()` override. Everything else is optional or documentation. No changes to `factory.py` are needed.

Checklist

File Required Why
src/lerobot/envs/<benchmark>.py Yes Wraps the simulator as a standard gym.Env
src/lerobot/envs/configs.py Yes Registers your benchmark and its create_envs() for the CLI
src/lerobot/processor/env_processor.py Optional Custom observation/action transforms
src/lerobot/envs/utils.py Optional Only if you need new raw observation keys
pyproject.toml Yes Declares benchmark-specific dependencies
docs/source/<benchmark>.mdx Yes User-facing documentation page
docs/source/_toctree.yml Yes Adds your page to the docs sidebar

1. The gym.Env wrapper ( src/lerobot/envs/<benchmark>.py )

Create a gym.Env subclass that wraps the third-party simulator:

class MyBenchmarkEnv(gym.Env):
    metadata = {"render_modes": ["rgb_array"], "render_fps": <fps>}

    def __init__(self, task_suite, task_id, ...):
        super().__init__()
        self.task = <task_name_string>
        self.task_description = <natural_language_instruction>
        self._max_episode_steps = <max_steps>
        self.observation_space = spaces.Dict({...})
        self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)

    def reset(self, seed=None, **kwargs):
        ...  # return (observation, info) — info must contain {"is_success": False}

    def step(self, action: np.ndarray):
        ...  # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}

    def render(self):
        ...  # return RGB image as numpy array

    def close(self):
        ...

GPU-based simulators (e.g. MuJoCo with EGL rendering): If your simulator allocates GPU/EGL contexts during __init__, defer that allocation to a _ensure_env() helper called on first reset()/step(). This avoids inheriting stale GPU handles when AsyncVectorEnv spawns worker processes. See LiberoEnv._ensure_env() for the pattern.

Also provide a factory function that returns the nested dict structure:

def create_mybenchmark_envs(
    task: str,
    n_envs: int,
    gym_kwargs: dict | None = None,
    env_cls: type | None = None,
) -> dict[str, dict[int, Any]]:
    """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
    ...

See create_libero_envs() (multi-suite, multi-task) and create_metaworld_envs() (difficulty-grouped tasks) for reference.

2. The config ( src/lerobot/envs/configs.py )

Register a config dataclass so users can select your benchmark with --env.type=<name>. Each config owns its environment creation and processor logic via two methods:

  • create_envs(n_envs, use_async_envs) — Returns {suite: {task_id: VectorEnv}}. The base class default uses gym.make() for single-task envs. Multi-task benchmarks override this.
  • get_env_processors() — Returns (preprocessor, postprocessor). The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
@EnvConfig.register_subclass("<benchmark_name>")
@dataclass
class MyBenchmarkEnvConfig(EnvConfig):
    task: str = "<default_task>"
    fps: int = <fps>
    obs_type: str = "pixels_agent_pos"

    features: dict[str, PolicyFeature] = field(default_factory=lambda: {
        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
    })
    features_map: dict[str, str] = field(default_factory=lambda: {
        ACTION: ACTION,
        "agent_pos": OBS_STATE,
        "pixels": OBS_IMAGE,
    })

    def __post_init__(self):
        ...  # populate features based on obs_type

    @property
    def gym_kwargs(self) -> dict:
        return {"obs_type": self.obs_type, "render_mode": self.render_mode}

    def create_envs(self, n_envs: int, use_async_envs: bool = True):
        """Override for multi-task benchmarks or custom env creation."""
        from lerobot.envs.<benchmark> import create_<benchmark>_envs
        return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)

    def get_env_processors(self):
        """Override if your benchmark needs observation/action transforms."""
        from lerobot.processor import PolicyProcessorPipeline
        from lerobot.processor.env_processor import MyBenchmarkProcessorStep
        return (
            PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
            PolicyProcessorPipeline(steps=[]),
        )

Key points:

  • The register_subclass name is what users pass on the CLI (--env.type=<name>).
  • features tells the policy what the environment produces.
  • features_map maps raw observation keys to LeRobot convention keys.
  • No changes to factory.py needed — the factory delegates to cfg.create_envs() and cfg.get_env_processors() automatically.

3. Env processor (optional — src/lerobot/processor/env_processor.py )

Only needed if your benchmark requires observation transforms beyond what preprocess_observation() handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from get_env_processors() in your config (see step 2):

@dataclass
@ProcessorStepRegistry.register(name="<benchmark>_processor")
class MyBenchmarkProcessorStep(ObservationProcessorStep):
    def _process_observation(self, observation):
        processed = observation.copy()
        # your transforms here
        return processed

    def transform_features(self, features):
        return features  # update if shapes change

    def observation(self, observation):
        return self._process_observation(observation)

See LiberoProcessorStep for a full example (image rotation, quaternion-to-axis-angle conversion).

4. Dependencies ( pyproject.toml )

Add a new optional-dependency group:

mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]

Pinning rules:

  • Always pin benchmark packages to exact versions for reproducibility (e.g. metaworld==3.0.0).
  • Add platform markers when needed (e.g. ; sys_platform == 'linux').
  • Pin fragile transitive deps if known (e.g. gymnasium==1.1.0 for Meta-World).
  • Document constraints in your benchmark doc page.

Users install with:

pip install -e ".[mybenchmark]"

5. Documentation ( docs/source/<benchmark>.mdx )

Write a user-facing page following the template in the next section. See docs/source/libero.mdx and docs/source/metaworld.mdx for full examples.

6. Table of contents ( docs/source/_toctree.yml )

Add your benchmark to the “Benchmarks” section:

- sections:
    - local: libero
      title: LIBERO
    - local: metaworld
      title: Meta-World
    - local: envhub_isaaclab_arena
      title: NVIDIA IsaacLab Arena Environments
    - local: <your_benchmark>
      title: <Your Benchmark Name>
  title: "Benchmarks"

Verifying your integration

After completing the steps above, confirm that everything works:

  1. Installpip install -e ".[mybenchmark]" and verify the dependency group installs cleanly.
  2. Smoke test env creation — call make_env() with your config in Python, check that the returned dict has the expected {suite: {task_id: VectorEnv}} shape, and that reset() returns observations with the right keys.
  3. Run a full evallerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --policy.path=<any_compatible_policy> to exercise the full pipeline end-to-end. (batch_size defaults to auto-tuning based on CPU cores; pass --eval.batch_size=1 to force a single environment.)
  4. Check success detection — verify that info["is_success"] flips to True when the task is actually completed. This is what the eval loop uses to compute success rates.

Writing a benchmark doc page

Each benchmark .mdx page should include:

  • Title and description — 1-2 paragraphs on what the benchmark tests and why it matters.
  • Links — paper, GitHub repo, project website (if available).
  • Overview image or GIF.
  • Available tasks — table of task suites with counts and brief descriptions.
  • Installationpip install -e ".[<benchmark>]" plus any extra steps (env vars, system packages).
  • Evaluation — recommended lerobot-eval command with n_episodes for reproducible results. batch_size defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable.
  • Policy inputs and outputs — observation keys with shapes, action space description.
  • Recommended evaluation episodes — how many episodes per task is standard.
  • Training — example lerobot-train command.
  • Reproducing published results — link to pretrained model, eval command, results table (if available).

See docs/source/libero.mdx and docs/source/metaworld.mdx for complete examples.

Update on GitHub