LeWM PushT on Franka FR3 (v2 β 304-ep decisive teleop)
An action-conditioned latent world model (LeWM) trained on real Franka FR3 PushT teleoperation data. Predicts the next visual-latent state given the current state and a 2-D Ξ-EE XY action, enabling MPPI planning from a goal image. No PRISM prior is bundled in this release β vanilla MPPI only. See Β§ "Plan-worthiness" below for what to expect.
Model summary
| Architecture | ViT-tiny visual encoder + 6-layer AR-Transformer predictor + Embedder action encoder (LeWM standard) |
| Parameters | 18.03 M total (10.79 M predictor + 5.7 M ViT-tiny + 1.5 M projectors/embedders) |
| Input | (224, 224, 3) RGB obs + (224, 224, 3) RGB goal |
| Output | 2-D Ξ-EE XY (meters) per env-tick at 10 Hz |
| Latent dim | 192 |
| Training data | Rongxuan-Zhou/pusht_lewm_fr3 β 304 episodes, 72,307 frames, decisive-style teleop |
| Training config | frameskip=5, num_steps=4, num_preds=1, history_size=3, 100 epochs |
| Optimizer | LightningAdamW + LinearWarmupCosineAnnealing |
| Loss | MSE + SIGReg anti-collapse regularizer |
| Final val_pred_loss | 0.0045 |
| Wall-clock | ~3 h 50 min on RTX 5090 |
Plan-worthiness diagnostics
Measured on the training distribution (pusht_lewm_fr3_2d_v2.h5, 120 eps Γ 3
seeds = 360 samples, block=5, K=512). See PRISM-JEPA docs/33 for full
discussion.
| Metric | Value | Interpretation |
|---|---|---|
| CV @ H=5 | 0.180 Β± 0.010 | Borderline β below the 0.30 plan-worthy threshold; vanilla MPPI will work but may struggle to converge on the best plan |
| GT_rank @ H=5 | 36.6 % Β± 2.1 | Direction is correct β expert action ranks better than ~63 % of random candidates (below the 50 % chance line; "weak-align" tier) |
| pred/id @ H=1 | 0.465 Β± 0.017 | Single-step rollout is meaningfully better than the "do-nothing" baseline (good action-conditioning) |
| pred/id @ H=5 | 0.151 Β± 0.002 | 5-step rollout is highly accurate (good for MPPI's planning horizon) |
| pred/id @ H=25 | 0.280 Β± 0.006 | Long-horizon rollout degrades; use H β€ 5β10 for planning |
Recommended MPPI horizon at deploy: H = 5 (= 25 env-ticks β 2.5 s at 10 Hz). Longer horizons accumulate too much rollout error and noise out MPPI's selection signal.
Quick start
pip install torch torchvision numpy einops transformers huggingface_hub
from huggingface_hub import snapshot_download
import numpy as np
# Download the bundle
local = snapshot_download("YuhaiW/lewm-pusht-fr3-v2")
# Add the bundle to your path so `jepa.py` and `module.py` are importable
import sys; sys.path.insert(0, local)
from pusht_lewm_inference import PushtLewmInference
planner = PushtLewmInference(
lewm_ckpt = f"{local}/lewm_pusht_fr3_v2.ckpt",
action_scaler = f"{local}/action_scaler.json",
device = "cuda",
)
# In your robot control loop (10 Hz):
while not done:
obs_uint8 = camera_rgb() # (224, 224, 3) uint8
goal_uint8 = goal_rgb() # (224, 224, 3) uint8
actions = planner.plan(obs_uint8, goal_uint8)
# (5, 2) float32 β meters Ξxy
for a in actions: # 5 actions for next 0.5 s
robot.send_delta_target(a) # operator-frame Ξxy
time.sleep(0.1) # 10 Hz tick
Robot expectations
| Robot | Franka FR3 (or compatible) with Cartesian impedance control in operator frame |
| Action interpretation | Ξ-target XY in meters, applied as a small step toward target position |
| Control frequency | 10 Hz (per-tick action represents ~0.1 s of motion) |
| Camera | Top-down RGB at 224 Γ 224 (matches training-time camera_top view) |
| Goal | Single RGB still showing the desired final scene |
| Z, rotation, gripper | NOT controlled by this model (XY-only by design; lock these in your controller) |
What's in the bundle
lewm_pusht_fr3_v2.ckpt # 72 MB β the world model (pickled JEPA object)
action_scaler.json # StandardScaler statistics (Ξxy meters, std β 8 mm)
pusht_lewm_inference.py # standalone vanilla-MPPI planner (self-contained)
jepa.py, module.py # required for ckpt deserialization
requirements.txt # minimal deps
README.md # this file
Caveats and limitations
- Borderline cost surface. CV @ H5 = 0.180 < the 0.30 "plan-worthy" threshold. Vanilla MPPI may not always converge to a clearly-best action. For more reliable deploy, layer a PRISM action prior on top (not included in this release).
- Trained on top-down RGB only. Side-view or first-person inputs are out of distribution; expect degraded behavior.
- 2-D XY action space. This model does NOT control Z, rotation, or gripper. Your robot stack must lock those (e.g., constant Z hover height).
- 10 Hz tick. Faster or slower control loops will see action magnitudes differently than training. The action scaler is fit at 10 Hz.
- Decisive teleop training distribution. The model expects pushes that look like "operator picks T target, pushes there in one smooth motion." Hesitant / re-correction-heavy teleop is OOD.
Architecture details
The world model follows the standard LeWM recipe:
RGB image (224Β²) Goal image (224Β²)
β β
βΌ βΌ
ViT-tiny encoder ViT-tiny encoder β shared weights
β β
βΌ βΌ
z_t β β^192 z_g β β^192
Per plan-step (= 5 env-ticks):
z_t, action_block (5 ticks of Ξxy, concatenated β β^10)
β
βΌ
action_encoder (Conv1d + MLP) β action_emb β β^192
β
βΌ
ARPredictor (6-layer Transformer, AdaLN action conditioning)
β
βΌ
z_{t+1} β β^192 β iterate H = 5 plan-steps β z_end
Cost = ||z_end β z_g||βΒ² β MPPI re-weights candidates
Provenance
- Trained: 2026-06-01 (RTX 5090, ~3 h 50 min)
- Project: PRISM-JEPA (research code)
- Companion:
YuhaiW/prism-jepa-red-cube-arxβ same architecture on ARX cube task, includes prior_head
Citation
If you use this model in research, please cite the PRISM paper (in preparation):
@misc{prism-jepa-pusht-fr3-v2,
title = {LeWM PushT on Franka FR3 (v2 β 304-ep decisive teleop)},
author = {Wang, Yuhai and Zhou, Rongxuan and collaborators},
year = {2026},
url = {https://huggingface.co/YuhaiW/lewm-pusht-fr3-v2}
}
License
Apache 2.0.