LeWM PushT on Franka FR3 (v2 — 304-ep decisive teleop)

An action-conditioned latent world model (LeWM) trained on real Franka FR3 PushT teleoperation data. Predicts the next visual-latent state given the current state and a 2-D Δ-EE XY action, enabling MPPI planning from a goal image. No PRISM prior is bundled in this release — vanilla MPPI only. See § "Plan-worthiness" below for what to expect.

Model summary


Architecture	ViT-tiny visual encoder + 6-layer AR-Transformer predictor + Embedder action encoder (LeWM standard)
Parameters	18.03 M total (10.79 M predictor + 5.7 M ViT-tiny + 1.5 M projectors/embedders)
Input	(224, 224, 3) RGB obs + (224, 224, 3) RGB goal
Output	2-D Δ-EE XY (meters) per env-tick at 10 Hz
Latent dim	192
Training data	`Rongxuan-Zhou/pusht_lewm_fr3` — 304 episodes, 72,307 frames, decisive-style teleop
Training config	`frameskip=5`, `num_steps=4`, `num_preds=1`, `history_size=3`, 100 epochs
Optimizer	LightningAdamW + LinearWarmupCosineAnnealing
Loss	MSE + SIGReg anti-collapse regularizer
Final val_pred_loss	0.0045
Wall-clock	~3 h 50 min on RTX 5090

Plan-worthiness diagnostics

Measured on the training distribution (pusht_lewm_fr3_2d_v2.h5, 120 eps × 3 seeds = 360 samples, block=5, K=512). See PRISM-JEPA docs/33 for full discussion.

Metric	Value	Interpretation
CV @ H=5	0.180 ± 0.010	Borderline — below the 0.30 plan-worthy threshold; vanilla MPPI will work but may struggle to converge on the best plan
GT_rank @ H=5	36.6 % ± 2.1	Direction is correct — expert action ranks better than ~63 % of random candidates (below the 50 % chance line; "weak-align" tier)
pred/id @ H=1	0.465 ± 0.017	Single-step rollout is meaningfully better than the "do-nothing" baseline (good action-conditioning)
pred/id @ H=5	0.151 ± 0.002	5-step rollout is highly accurate (good for MPPI's planning horizon)
pred/id @ H=25	0.280 ± 0.006	Long-horizon rollout degrades; use H ≤ 5–10 for planning

Recommended MPPI horizon at deploy: H = 5 (= 25 env-ticks ≈ 2.5 s at 10 Hz). Longer horizons accumulate too much rollout error and noise out MPPI's selection signal.

Quick start

pip install torch torchvision numpy einops transformers huggingface_hub

from huggingface_hub import snapshot_download
import numpy as np

# Download the bundle
local = snapshot_download("YuhaiW/lewm-pusht-fr3-v2")

# Add the bundle to your path so `jepa.py` and `module.py` are importable
import sys; sys.path.insert(0, local)

from pusht_lewm_inference import PushtLewmInference

planner = PushtLewmInference(
    lewm_ckpt     = f"{local}/lewm_pusht_fr3_v2.ckpt",
    action_scaler = f"{local}/action_scaler.json",
    device        = "cuda",
)

# In your robot control loop (10 Hz):
while not done:
    obs_uint8  = camera_rgb()              # (224, 224, 3) uint8
    goal_uint8 = goal_rgb()                # (224, 224, 3) uint8
    actions    = planner.plan(obs_uint8, goal_uint8)
                                           # (5, 2) float32 — meters Δxy
    for a in actions:                      # 5 actions for next 0.5 s
        robot.send_delta_target(a)         # operator-frame Δxy
        time.sleep(0.1)                    # 10 Hz tick

Robot expectations


Robot	Franka FR3 (or compatible) with Cartesian impedance control in operator frame
Action interpretation	Δ-target XY in meters, applied as a small step toward target position
Control frequency	10 Hz (per-tick action represents ~0.1 s of motion)
Camera	Top-down RGB at 224 × 224 (matches training-time `camera_top` view)
Goal	Single RGB still showing the desired final scene
Z, rotation, gripper	NOT controlled by this model (XY-only by design; lock these in your controller)

What's in the bundle

lewm_pusht_fr3_v2.ckpt       # 72 MB — the world model (pickled JEPA object)
action_scaler.json           # StandardScaler statistics (Δxy meters, std ≈ 8 mm)
pusht_lewm_inference.py      # standalone vanilla-MPPI planner (self-contained)
jepa.py, module.py           # required for ckpt deserialization
requirements.txt             # minimal deps
README.md                    # this file

Caveats and limitations

Borderline cost surface. CV @ H5 = 0.180 < the 0.30 "plan-worthy" threshold. Vanilla MPPI may not always converge to a clearly-best action. For more reliable deploy, layer a PRISM action prior on top (not included in this release).
Trained on top-down RGB only. Side-view or first-person inputs are out of distribution; expect degraded behavior.
2-D XY action space. This model does NOT control Z, rotation, or gripper. Your robot stack must lock those (e.g., constant Z hover height).
10 Hz tick. Faster or slower control loops will see action magnitudes differently than training. The action scaler is fit at 10 Hz.
Decisive teleop training distribution. The model expects pushes that look like "operator picks T target, pushes there in one smooth motion." Hesitant / re-correction-heavy teleop is OOD.

Architecture details

The world model follows the standard LeWM recipe:

RGB image (224²)                Goal image (224²)
        │                              │
        ▼                              ▼
   ViT-tiny encoder              ViT-tiny encoder      ←  shared weights
        │                              │
        ▼                              ▼
   z_t ∈ ℝ^192                 z_g ∈ ℝ^192

Per plan-step (= 5 env-ticks):
   z_t, action_block (5 ticks of Δxy, concatenated → ℝ^10)
        │
        ▼
   action_encoder (Conv1d + MLP) → action_emb ∈ ℝ^192
        │
        ▼
   ARPredictor (6-layer Transformer, AdaLN action conditioning)
        │
        ▼
   z_{t+1} ∈ ℝ^192  →  iterate H = 5 plan-steps  →  z_end

Cost = ||z_end − z_g||₂²  →  MPPI re-weights candidates

Provenance

Trained: 2026-06-01 (RTX 5090, ~3 h 50 min)
Project: PRISM-JEPA (research code)
Companion: YuhaiW/prism-jepa-red-cube-arx — same architecture on ARX cube task, includes prior_head

Citation

If you use this model in research, please cite the PRISM paper (in preparation):

@misc{prism-jepa-pusht-fr3-v2,
  title  = {LeWM PushT on Franka FR3 (v2 — 304-ep decisive teleop)},
  author = {Wang, Yuhai and Zhou, Rongxuan and collaborators},
  year   = {2026},
  url    = {https://huggingface.co/YuhaiW/lewm-pusht-fr3-v2}
}

License

Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

YuhaiW
/

lewm-pusht-fr3-v2