X-VLA-PickOrange

针对 LeIsaac SO-101 PickOrange 任务从 X-VLA-base 微调的 X-VLA (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。 An X-VLA (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from X-VLA-base on the LeIsaac SO-101 PickOrange task.

X-VLA-PickOrange — SO-101 in Isaac Sim

🔗 项目仓库 / Project repos

TL;DR

  • 任务 / TaskPick up the orange and put it in the plate — SO-101 单臂依次夹起 3 颗橙子并放盘子。 Single-arm SO-101 picks 3 oranges sequentially and places each in a plate.
  • 数据集 / DatasetLightwheelAI/leisaac-pick-orange — 60 episode 遥操示范(50 train / 10 val split)。
  • 架构 / Architecture:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。
  • 训练 / Training:batch=8 / lr=1e-4 / 10k step / weak image-aug (brightness ±5% only) / GRIPPER_SCALE=5 / ~18 min on RTX 4090。
  • 评测 / Eval(benchmark-aligned 3 round × 120s sim × 180s wall_cap,与 leaderboard 其他 baseline 同条件):4/9 oranges (44%)ep2 = [T, T, T] 3/3 ⭐。
  • ⚠️ 关键 inference 配置 / Critical inference settingn_action_steps=32(chunk_size 整 reuse)。 默认 n_action_steps=8 在此 ckpt 上 6-round = 0/18 灾难性失败(每步重 plan 互相冲突)。详见下方 Inference caveat

模型亮点

Highlights

  • Benchmark setting (3 round × 120s sim × 180s wall_cap) 下 ep2 = 3/3 perfect 全部完成。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3。 Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 oranges — a feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation.
  • 暴露了 n_action_steps 的关键作用:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。 Exposes n_action_steps as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline.
  • Weak image-aug 是唯一 aggregate 正向 retrain:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline,10k 达到 44% per-ep。 Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), only weak image-aug was net positive. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark.

训练配方

Training recipe

# 一段式 10k step from lerobot/xvla-base
WEAK_IMAGE_AUG=1 \
BATCH_SIZE=8 \
MAX_STEPS=10000 \
SAVE_FREQ=500 \
OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
bash LeIsaac/scripts/finetune/xvla/train.sh

WEAK_IMAGE_AUG=1train.sh 内展开为:

--dataset.image_transforms.enable=true
--dataset.image_transforms.max_num_transforms=1
--dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}}

即:每 batch 至多采样 1 个 transform,且只允许 brightness ±5%(关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine)。

详细对比见 完整 retrain 聚合表

推理 / Inference

端到端 server(Isaac Sim ZMQ 客户端兼容)

# 启动 X-VLA 推理服务(ZMQ REQ/REP + msgpack)
N_ACTION_STEPS=32 \
PROMPT="Pick up the orange and put it in the plate" \
CKPT=<this_repo_dir> \
PORT=5558 \
bash server/serve_xvla.sh --detach

# 在 Isaac Sim 客户端跑 PickOrange eval
POLICY_PORT=5558 \
POLICY_TIMEOUT_MS=3000 \
ACTION_HORIZON=1 \
EVAL_ROUNDS=3 \
EPISODE_LENGTH=120 \
PROMPT="Pick up the orange and put it in the plate" \
MAX_ROUND_WALL_S=180 \
bash server/eval_pi05.sh

Server 实现。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。

🔴 推理关键配置 / Critical inference caveat

n_action_steps 6-round oranges per-ep 备注
8 (lerobot default) 0/18 0% 每步 replan,chunk[0]→chunk[0]→... 互相打架
16 4/18 22% 部分 chunk 复用
32 (= chunk_size) 6/18 + 3/3 perfect 33% 全 chunk 复用,单 chunk 自洽

X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。

评测结果

Evaluation

Benchmark-aligned (3 round × 120s sim × 180s wall_cap) — leaderboard 同条件

Episode oranges placed wall time 备注
1 1/3 180.1s wall_cap
2 3/3 180.0s 3/3 perfect
3 0/3 180.1s wall_cap
Total 4/9 (44%) 0/3 strict(env 未 report done,仅放对 3 颗)

6-round 扩展 eval (60s sim × 90s wall_cap)

Episode oranges placed wall time
1 1/3 90.0s
2 3/3 90.0s
3 0/3 90.0s
4 1/3 90.1s
5 0/3 90.0s
6 1/3 90.1s
Total 6/18 (33%)

完整 retrain 实验聚合表

Retrain config (5 ckpts × 6-round = 90 ep) per-ep aggregate vs baseline
🥇 Weak image-aug (brightness ±5%) 30.0% +5.6
L1 loss (OFT-lite, Fine-Tuning VLA 2502.19645) 27.8% +3.4
Baseline (no retrain) 24.4%
L1 + weak aug compound 15.6% -8.8 (负干扰)
Default image-aug (lerobot 默认强度) 13.3% -11.1
Velocity-reweight β=2.0 (AttenA+ 2605.13548) ~11% -13

详见父项目 HTML 设计文档 vla_improvement_methods_checklist.html(含 90+ 个 hyperparam sweep CSV)。

已证伪 / 不要再试的方法

Negative findings — DO NOT repeat

90+ 实验中已严格证伪(≥36 ep cumulative):

  • ❌ **TAE (Temporal Action Ensembling, ALOHA 2304.13705)**:K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。
  • ❌ **EMA action smoothing α∈[0.2, 0.7]**:3-round 上 α=0.3=5/9 是单 ep outlier;12-round retest = 2/18,实际有害。
  • "Grasp" verb in prompt:0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。
  • "all " prompts:3/18,触发多目标歧义。
  • 短 prompt 缺 "Pick up" preamble:1/18,无法 ground。
  • "on/onto the plate" 介词:≤2/18,远不如 "in the plate"(容器语义)。
  • ❌ **Body-desc retrain (Path 2)**:Florence2 freeze 下长 prompt 只是 token 微扰,不改 action conditioning。
  • Offline action-MSE eval:不预测 closed-loop(多次证伪)。只能 Isaac Sim 实测
  • 3-round closed-loop eval:方差 ±15-30%。**所有决策必须 ≥6-round (≥18 ep),对比必须 ≥12-round (≥36 ep)**。

限制 / Limitations

  • 样本数小:44% per-ep 是 benchmark 3-round (9 ep) 估计,置信区间宽 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)。
  • 数据集只有 50 demo:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。
  • place 子任务多模态:模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
  • chunk_size=32 与 wall_clock:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。

引用 / Citations

License

Apache-2.0,与 lerobot / X-VLA-base 一致。

Downloads last month
-
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Video Preview
loading

Model tree for wsagi/X-VLA-PickOrange

Finetuned
(4)
this model

Dataset used to train wsagi/X-VLA-PickOrange

Papers for wsagi/X-VLA-PickOrange