ACT (Action Chunking with Transformers)

ACT is a lightweight and efficient policy for imitation learning, especially well-suited for fine-grained manipulation tasks. It’s the first model we recommend when you’re starting out with LeRobot due to its fast training time, low computational requirements, and strong performance.

Watch this tutorial from the LeRobot team to learn how ACT works: LeRobot ACT Tutorial

Model Overview

Action Chunking with Transformers (ACT) was introduced in the paper Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware by Zhao et al. The policy was designed to enable precise, contact-rich manipulation tasks using affordable hardware and minimal demonstration data.

Why ACT is Great for Beginners

ACT stands out as an excellent starting point for several reasons:

Fast Training: Trains in a few hours on a single GPU
Lightweight: Only ~80M parameters, making it efficient and easy to work with
Data Efficient: Often achieves high success rates with just 50 demonstrations

Architecture

ACT uses a transformer-based architecture with three main components:

Vision Backbone: ResNet-18 processes images from multiple camera viewpoints
Transformer Encoder: Synthesizes information from camera features, joint positions, and a learned latent variable
Transformer Decoder: Generates coherent action sequences using cross-attention

The policy takes as input:

Multiple RGB images (e.g., from wrist cameras, front/top cameras)
Current robot joint positions
A latent style variable z (learned during training, set to zero during inference)

And outputs a chunk of k future action sequences.

Installation Requirements

Install LeRobot by following our Installation Guide.
ACT is included in the base LeRobot installation, so no additional dependencies are needed!

Training ACT

ACT works seamlessly with the standard LeRobot training pipeline. Here’s a complete example for training ACT on your dataset:

lerobot-train \
  --dataset.repo_id=${HF_USER}/your_dataset \
  --policy.type=act \
  --output_dir=outputs/train/act_your_dataset \
  --job_name=act_your_dataset \
  --policy.device=cuda \
  --wandb.enable=true \
  --policy.repo_id=${HF_USER}/act_policy

Training Tips

Start with defaults: ACT’s default hyperparameters work well for most tasks
Training duration: Expect a few hours for 100k training steps on a single GPU
Batch size: Start with batch size 8 and adjust based on your GPU memory

Train using Google Colab

If your local computer doesn’t have a powerful GPU, you can utilize Google Colab to train your model by following the ACT training notebook.

Evaluating ACT

Once training is complete, you can evaluate your ACT policy using the lerobot-record command with your trained policy. This will run inference and record evaluation episodes:

lerobot-record \
  --robot.type=so100_follower \
  --robot.port=/dev/ttyACM0 \
  --robot.id=my_robot \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
  --display_data=true \
  --dataset.repo_id=${HF_USER}/eval_act_your_dataset \
  --dataset.num_episodes=10 \
  --dataset.single_task="Your task description" \
  --policy.path=${HF_USER}/act_policy

Update on GitHub