dk1-pretrain-40d-30hz-joints

Fast-WAM world-action model for the DK-1 bimanual robot, pretrained on a diverse blend of real teleop + synthetic data. It jointly predicts future video and a chunk of future robot actions via flow matching, sharing self-attention between the two modalities (MoT). Action output is a single joint-space relative head (multistep_rel), 14-D actions over a 50-step horizon.

The backbone is a half-size slice (15 of 30 layers) of Wan2.2-TI2V-5B (~3.3B trainable parameters), trained in bf16.

Checkpoints

path	step	weights
`step_100000/model.pt`	100k (final)	raw model
`step_100000/ema.pt`	100k (final)	EMA (recommended for inference)
`step_80000/model.pt`	80k	raw model
`step_80000/ema.pt`	80k	EMA
`step_50000/model.pt`	50k	raw model
`step_50000/ema.pt`	50k	EMA

All checkpoints are bf16. Each .pt is a dict with model_state_dict, step, config, norm_stats, and norm_config — self-contained for inference. Prefer the ema.pt weights.

Training

Training run (W&B): loss / eval plots

Parallelism: DDP (full model per GPU, gradients averaged), bf16 autocast, 8-bit AdamW, gradient checkpointing, torch.compile.
Inputs: 3 cameras (head, left/right wrist), 384×320, 5 observation + 8 future frames; proprio = pos/vel/torque (40-D), history length 3.
Rate / horizon: ~30 Hz, action horizon 50, RTC training enabled.
Data: DK-1 teleop (swan, cutlery-basket, duplo) + dk1-merge + RoboTwin synthetic (stack-blocks), 14-D pos-only sets zero-padded to 40-D.

Notes

Research checkpoint. Inherits the license and usage terms of the Wan2.2-TI2V-5B base model. Action/state normalization is robot-specific — use the bundled norm_stats.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Model tree for andreaskoepf/dk1-pretrain-40d-30hz-joints

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(51)

this model