ACT — ALOHA Single-Arm (Left) — Mask REMOVAL via Reversed Data — 40k steps

Action Chunking Transformer (ACT) policy for mask removal trained on a synthetic dataset derived by time-reversing the placement dataset. Each placement episode reversed becomes a removal episode (gripper opens → closes, mask on face → in arm).

This is the 40k-step retrain (S006), matching S003's step count for direct architectural comparison vs the shipped placement baseline. The 13.4k baseline lives at JHeisler/aloha_solo_left_act_removal_reversed_13k.

Training Config

Field	Value
Architecture	ACT (ResNet18 backbone + 4-layer Transformer encoder + VAE chunking head)
Dataset	JHeisler/aloha_solo_left_4_6_26_reversed — 50 ep, 29,735 samples, 30 fps, time-reversed with 1-step action shift
State / action dim	9 / 9
Cameras	`cam_high`, `cam_left_wrist` (3×480×640 each)
Steps	40,000
Batch size	48
Learning rate	6e-5 (linear warmup 500 → cosine)
Total samples seen	~~1.92M (~~64 epochs over the dataset)
AMP	enabled
torch.compile	enabled
Save freq	every 10,000 steps (10k / 20k / 30k / 40k checkpoints)
Final loss	0.016–0.020
Final grad norm	0.23–0.32
Wall clock	~6h 10min on RTX A4500 (matches placement S003's ~6h 7min)
LeRobot pin	`96c7052777aca85d4e55dfba8f81586103ba8f61`

Project Lineage

Workstream	Task	Steps	Final loss	HF
S001	placement	13,400	0.029	act_left
S005	removal (reversed)	13,400	0.035	act_removal_reversed_13k
S003	placement (shipped)	40,000	0.015	act_left_40k
S006	removal (reversed)	40,000	0.018	this repo

S003 vs S006 is the direct architectural comparison — same arch, same step count, placement dataset vs reversed-placement dataset. Final losses differ by only 3 milliloss (0.015 vs 0.018), suggesting the reversed-data policy converges to a similar quality as the forward-data policy on the per-timestep imitation objective. Real verdict requires offline action-L1 eval on held-out data or robot rollout.

Caveats

Synthetic data. Trained on time-reversed placement, not native removal. A policy trained on real removal data will likely outperform.
Visual transitions are physically backwards (mask materializes on face). Doesn't affect ACT's per-timestep predictions (n_obs_steps=1, no temporal context input).
Use as a lower-bound baseline until native removal data is available.

Usage

from lerobot.common.policies.act.modeling_act import ACTPolicy
policy = ACTPolicy.from_pretrained("JHeisler/aloha_solo_left_act_removal_reversed_40k")

Citation / Course

EN.525.681 school project — JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla.

Code reference: HuggingFace LeRobot at commit 96c7052.

Downloads last month: 30

Video Preview

Robotics

JHeisler
/

aloha_solo_left_act_removal_reversed_40k