LeRobot documentation

Using Dataset Tools

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Using Dataset Tools

This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets.

Overview

LeRobot provides several utilities for manipulating datasets:

  1. Delete Episodes - Remove specific episodes from a dataset
  2. Split Dataset - Divide a dataset into multiple smaller datasets
  3. Merge Datasets - Combine multiple datasets into one. The datasets must have identical features, and episodes are concatenated in the order specified in repo_ids
  4. Add Features - Add new features to a dataset
  5. Remove Features - Remove features from a dataset

The core implementation is in lerobot.datasets.dataset_tools. An example script detailing how to use the tools API is available in examples/dataset/use_dataset_tools.py.

Command-Line Tool: lerobot-edit-dataset

lerobot-edit-dataset is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.

Run lerobot-edit-dataset --help for more information on the configuration of each operation.

Usage Examples

Delete Episodes

Remove specific episodes from a dataset. This is useful for filtering out undesired data.

# Delete episodes 0, 2, and 5 (modifies original dataset)
lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --operation.type delete_episodes \
    --operation.episode_indices "[0, 2, 5]"

# Delete episodes and save to a new dataset (preserves original dataset)
lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --new_repo_id lerobot/pusht_after_deletion \
    --operation.type delete_episodes \
    --operation.episode_indices "[0, 2, 5]"

Split Dataset

Divide a dataset into multiple subsets.

# Split by fractions (e.g. 80% train, 20% test, 20% val)
lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --operation.type split \
    --operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}'

# Split by specific episode indices
lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --operation.type split \
    --operation.splits '{"task1": [0, 1, 2, 3], "task2": [4, 5]}'

There are no constraints on the split names, they can be determined by the user. Resulting datasets are saved under the repo id with the split name appended, e.g. lerobot/pusht_train, lerobot/pusht_task1, lerobot/pusht_task2.

Merge Datasets

Combine multiple datasets into a single dataset.

# Merge train and validation splits back into one dataset
lerobot-edit-dataset \
    --repo_id lerobot/pusht_merged \
    --operation.type merge \
    --operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']"

Remove Features

Remove features from a dataset.

# Remove a camera feature
lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --operation.type remove_feature \
    --operation.feature_names "['observation.images.top']"

Push to Hub

Add the --push_to_hub flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:

lerobot-edit-dataset \
    --repo_id lerobot/pusht \
    --new_repo_id lerobot/pusht_after_deletion \
    --operation.type delete_episodes \
    --operation.episode_indices "[0, 2, 5]" \
    --push_to_hub

There is also a tool for adding features to a dataset that is not yet covered in lerobot-edit-dataset.

Update on GitHub