LeWM PushT on Franka FR3 (v2 β€” 304-ep decisive teleop)

An action-conditioned latent world model (LeWM) trained on real Franka FR3 PushT teleoperation data. Predicts the next visual-latent state given the current state and a 2-D Ξ”-EE XY action, enabling MPPI planning from a goal image. No PRISM prior is bundled in this release β€” vanilla MPPI only. See Β§ "Plan-worthiness" below for what to expect.

Model summary

Architecture ViT-tiny visual encoder + 6-layer AR-Transformer predictor + Embedder action encoder (LeWM standard)
Parameters 18.03 M total (10.79 M predictor + 5.7 M ViT-tiny + 1.5 M projectors/embedders)
Input (224, 224, 3) RGB obs + (224, 224, 3) RGB goal
Output 2-D Ξ”-EE XY (meters) per env-tick at 10 Hz
Latent dim 192
Training data Rongxuan-Zhou/pusht_lewm_fr3 β€” 304 episodes, 72,307 frames, decisive-style teleop
Training config frameskip=5, num_steps=4, num_preds=1, history_size=3, 100 epochs
Optimizer LightningAdamW + LinearWarmupCosineAnnealing
Loss MSE + SIGReg anti-collapse regularizer
Final val_pred_loss 0.0045
Wall-clock ~3 h 50 min on RTX 5090

Plan-worthiness diagnostics

Measured on the training distribution (pusht_lewm_fr3_2d_v2.h5, 120 eps Γ— 3 seeds = 360 samples, block=5, K=512). See PRISM-JEPA docs/33 for full discussion.

Metric Value Interpretation
CV @ H=5 0.180 Β± 0.010 Borderline β€” below the 0.30 plan-worthy threshold; vanilla MPPI will work but may struggle to converge on the best plan
GT_rank @ H=5 36.6 % Β± 2.1 Direction is correct β€” expert action ranks better than ~63 % of random candidates (below the 50 % chance line; "weak-align" tier)
pred/id @ H=1 0.465 Β± 0.017 Single-step rollout is meaningfully better than the "do-nothing" baseline (good action-conditioning)
pred/id @ H=5 0.151 Β± 0.002 5-step rollout is highly accurate (good for MPPI's planning horizon)
pred/id @ H=25 0.280 Β± 0.006 Long-horizon rollout degrades; use H ≀ 5–10 for planning

Recommended MPPI horizon at deploy: H = 5 (= 25 env-ticks β‰ˆ 2.5 s at 10 Hz). Longer horizons accumulate too much rollout error and noise out MPPI's selection signal.

Quick start

pip install torch torchvision numpy einops transformers huggingface_hub
from huggingface_hub import snapshot_download
import numpy as np

# Download the bundle
local = snapshot_download("YuhaiW/lewm-pusht-fr3-v2")

# Add the bundle to your path so `jepa.py` and `module.py` are importable
import sys; sys.path.insert(0, local)

from pusht_lewm_inference import PushtLewmInference

planner = PushtLewmInference(
    lewm_ckpt     = f"{local}/lewm_pusht_fr3_v2.ckpt",
    action_scaler = f"{local}/action_scaler.json",
    device        = "cuda",
)

# In your robot control loop (10 Hz):
while not done:
    obs_uint8  = camera_rgb()              # (224, 224, 3) uint8
    goal_uint8 = goal_rgb()                # (224, 224, 3) uint8
    actions    = planner.plan(obs_uint8, goal_uint8)
                                           # (5, 2) float32 β€” meters Ξ”xy
    for a in actions:                      # 5 actions for next 0.5 s
        robot.send_delta_target(a)         # operator-frame Ξ”xy
        time.sleep(0.1)                    # 10 Hz tick

Robot expectations

Robot Franka FR3 (or compatible) with Cartesian impedance control in operator frame
Action interpretation Ξ”-target XY in meters, applied as a small step toward target position
Control frequency 10 Hz (per-tick action represents ~0.1 s of motion)
Camera Top-down RGB at 224 Γ— 224 (matches training-time camera_top view)
Goal Single RGB still showing the desired final scene
Z, rotation, gripper NOT controlled by this model (XY-only by design; lock these in your controller)

What's in the bundle

lewm_pusht_fr3_v2.ckpt       # 72 MB β€” the world model (pickled JEPA object)
action_scaler.json           # StandardScaler statistics (Ξ”xy meters, std β‰ˆ 8 mm)
pusht_lewm_inference.py      # standalone vanilla-MPPI planner (self-contained)
jepa.py, module.py           # required for ckpt deserialization
requirements.txt             # minimal deps
README.md                    # this file

Caveats and limitations

  1. Borderline cost surface. CV @ H5 = 0.180 < the 0.30 "plan-worthy" threshold. Vanilla MPPI may not always converge to a clearly-best action. For more reliable deploy, layer a PRISM action prior on top (not included in this release).
  2. Trained on top-down RGB only. Side-view or first-person inputs are out of distribution; expect degraded behavior.
  3. 2-D XY action space. This model does NOT control Z, rotation, or gripper. Your robot stack must lock those (e.g., constant Z hover height).
  4. 10 Hz tick. Faster or slower control loops will see action magnitudes differently than training. The action scaler is fit at 10 Hz.
  5. Decisive teleop training distribution. The model expects pushes that look like "operator picks T target, pushes there in one smooth motion." Hesitant / re-correction-heavy teleop is OOD.

Architecture details

The world model follows the standard LeWM recipe:

RGB image (224Β²)                Goal image (224Β²)
        β”‚                              β”‚
        β–Ό                              β–Ό
   ViT-tiny encoder              ViT-tiny encoder      ←  shared weights
        β”‚                              β”‚
        β–Ό                              β–Ό
   z_t ∈ ℝ^192                 z_g ∈ ℝ^192

Per plan-step (= 5 env-ticks):
   z_t, action_block (5 ticks of Ξ”xy, concatenated β†’ ℝ^10)
        β”‚
        β–Ό
   action_encoder (Conv1d + MLP) β†’ action_emb ∈ ℝ^192
        β”‚
        β–Ό
   ARPredictor (6-layer Transformer, AdaLN action conditioning)
        β”‚
        β–Ό
   z_{t+1} ∈ ℝ^192  β†’  iterate H = 5 plan-steps  β†’  z_end

Cost = ||z_end βˆ’ z_g||β‚‚Β²  β†’  MPPI re-weights candidates

Provenance

Citation

If you use this model in research, please cite the PRISM paper (in preparation):

@misc{prism-jepa-pusht-fr3-v2,
  title  = {LeWM PushT on Franka FR3 (v2 β€” 304-ep decisive teleop)},
  author = {Wang, Yuhai and Zhou, Rongxuan and collaborators},
  year   = {2026},
  url    = {https://huggingface.co/YuhaiW/lewm-pusht-fr3-v2}
}

License

Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train YuhaiW/lewm-pusht-fr3-v2