dk1-pretrain-40d-30hz-joints
Fast-WAM world-action model for the DK-1 bimanual robot, pretrained on a
diverse blend of real teleop + synthetic data. It jointly predicts future video
and a chunk of future robot actions via flow matching, sharing self-attention
between the two modalities (MoT). Action output is a single joint-space
relative head (multistep_rel), 14-D actions over a 50-step horizon.
The backbone is a half-size slice (15 of 30 layers) of Wan2.2-TI2V-5B (~3.3B trainable parameters), trained in bf16.
Checkpoints
| path | step | weights |
|---|---|---|
step_100000/model.pt |
100k (final) | raw model |
step_100000/ema.pt |
100k (final) | EMA (recommended for inference) |
step_80000/model.pt |
80k | raw model |
step_80000/ema.pt |
80k | EMA |
step_50000/model.pt |
50k | raw model |
step_50000/ema.pt |
50k | EMA |
All checkpoints are bf16. Each .pt is a dict with model_state_dict,
step, config, norm_stats, and norm_config — self-contained for
inference. Prefer the ema.pt weights.
Training
Training run (W&B): loss / eval plots
- Parallelism: DDP (full model per GPU, gradients averaged), bf16 autocast,
8-bit AdamW, gradient checkpointing,
torch.compile. - Inputs: 3 cameras (head, left/right wrist), 384×320, 5 observation + 8 future frames; proprio = pos/vel/torque (40-D), history length 3.
- Rate / horizon: ~30 Hz, action horizon 50, RTC training enabled.
- Data: DK-1 teleop (swan, cutlery-basket, duplo) +
dk1-merge+ RoboTwin synthetic (stack-blocks), 14-D pos-only sets zero-padded to 40-D.
Notes
Research checkpoint. Inherits the license and usage terms of the
Wan2.2-TI2V-5B base model.
Action/state normalization is robot-specific — use the bundled norm_stats.
Model tree for andreaskoepf/dk1-pretrain-40d-30hz-joints
Base model
Wan-AI/Wan2.2-TI2V-5B