--- license: mit tags: - robotics - manipulation - oat - libero - blockwise-decoding --- # Blockwise-OAT — strict original-OAT baseline (LIBERO-10) Paired evaluation of **autoregressive (AR)** vs **blockwise parallel tail** action-token generation on a frozen [OAT](https://arxiv.org/abs/2602.04215) policy. **HF repo:** [hackhackhack66666/Blockwise-OAT](https://huggingface.co/hackhackhack66666/Blockwise-OAT) **Code branch:** `Blockwise-OAT` on [GadzhiAskhabaliev/OAT-BLT-Dense](https://github.com/GadzhiAskhabaliev/OAT-BLT-Dense) ## Summary Primary SR reference: **OAT8 paper** on LIBERO-10 — **56.3%** ([OAT](https://arxiv.org/abs/2602.04215), external benchmark). | Metric | AR (our eval) | Blockwise (P=4, r=1) | |--------|---------------|----------------------| | LIBERO-10 mean SR | **58.73% ± 0.18%** | **52.33% ± 1.04%** | | **Δ vs OAT paper (56.3%)** | **+2.43 pp** | **-3.97 pp** | | Paired Δ (BW − AR, same protocol) | — | **-6.40 pp** | | Tail train epochs | — | 15 (final CE 3.0607) | Our frozen AR checkpoint reproduces above the paper on this cluster stack (58.73% vs 56.3%). Blockwise trades SR for faster token generation; tail training was only 15 epochs (resume planned). ### Inference speed (V100, cuda:0) **Decoder-only** — 8 action tokens after `cond` is computed (`benchmark_blockwise_vs_ar`, warmup=10, 50 repeats): | Batch | AR | Blockwise | Speedup | |-------|-----|-----------|---------| | bs=1 | 22.3 ms | 19.3 ms | **1.16×** | | bs=8 | 31.4 ms | 26.6 ms | **1.18×** | **End-to-end `predict_action`** — vision encoder + decoder + detokenize (warmup=20, 100 repeats): | Batch | AR | Blockwise | Speedup | |-------|-----|-----------|---------| | bs=1 | 36.4 ms | 30.1 ms | **1.21×** | | bs=8 | 37.0 ms | 34.8 ms | **1.06×** | Decoder speedup is modest (~14–18% faster at bs=1) because the tail module is comparable in size to the AR stack; e2e gain is smaller still when the vision encoder dominates latency. ## Baseline artifacts (frozen) | Component | Source | |-----------|--------| | Policy | [Mirageinv/oat — policy_ep-0250_sr-0.596.ckpt](https://huggingface.co/Mirageinv/oat) | | Tokenizer | [Mirageinv/oat — tokenizer_ep-0950_mse-0.002.ckpt](https://huggingface.co/Mirageinv/oat) | | Tail decoder | `checkpoints/original_oat_tail_p4_r1.pt` (this repo) | ## Architecture & data flow OAT encodes observations and generates **8 action tokens** `z₁…z₈`. Blockwise-OAT splits decoding: ``` Obs (RGB + proprio) ──► Vision encoder ──► cond [B, T_o, d] │ ┌───────────────────────────────┴───────────────────────────────┐ │ AR path (baseline) │ │ BOS ──► AutoregressiveModel.generate (8 steps) ──► z₁…z₈ │ └───────────────────────────────┬───────────────────────────────┘ │ ┌───────────────────────────────┴───────────────────────────────┐ │ Blockwise path │ │ BOS ──► generate_prefix (P=4 AR steps) ──► z₁…z₄, h_prefix │ │ (z₁…z₄, h_prefix) ──► ParallelTailDecoder (1 pass) ──► z₅…z₈│ └───────────────────────────────┬───────────────────────────────┘ ▼ cat(z_prefix, z_tail) ──► OATTok.detokenize ──► action chunk ``` **Inputs:** multi-view RGB, robot state, task id (same as OAT). **Outputs:** `action` / `action_pred` tensors (identical shapes for AR and Blockwise). **Trainable in this run:** only `ParallelTailDecoder` (~4.5M params, 0.90× AR size). ### Generation schedule | Mode | AR forward passes | Tail passes | |------|-------------------|-------------| | Full AR | 8 | 0 | | Blockwise P=4 | 4 | 1 | ## Experiment protocol 1. Download Mirageinv/oat policy + tokenizer. 2. Train `ParallelTailDecoder` on `libero10_N500` with frozen policy (15 epochs, bs=64, lr=1e-4). 3. Paired sim-eval: `50` episodes/task × `3` seeds (`test_start_seed=1000`). 4. Benchmarks: dataset / training / policy verification + wall-clock speed. Cluster launcher: `scripts/cluster/run_blockwise_original_oat_baseline.sh` (`PHASE=B NUM_EXP=3`). ## Visualizations | Figure | Description | |--------|-------------| | ![AR eval](benchmarks/ar_sim_eval_dashboard.png) | AR per-task SR | | ![Blockwise eval](benchmarks/blockwise_sim_eval_dashboard.png) | Blockwise per-task SR | | ![Paired](benchmarks/paired_sr_comparison_dashboard.png) | Side-by-side per-task comparison | | ![Speed](benchmarks/speed_benchmark_dashboard.png) | Decoder + E2E latency | | ![Tail train](benchmarks/tail_training_dashboard.png) | Tail CE loss curve | | ![Verify](benchmarks/verification_summary_dashboard.png) | Verification kit | ## Repository layout ``` checkpoints/original_oat_tail_p4_r1.pt # trained tail decoder eval/ar_eval_log.json # AR sim metrics eval/blockwise_eval_log.json # Blockwise sim metrics benchmarks/*.json # verification + speed raw logs benchmarks/*_dashboard.png # plots above ``` ## Reproduce inference ```bash python scripts/eval_policy_sim.py \ -c output/baselines/original_oat/hf/policy_ep-0250_sr-0.596.ckpt \ -o output/eval/blockwise/ar \ --tokenizer-checkpoint output/baselines/original_oat/hf/tokenizer_ep-0950_mse-0.002.ckpt python scripts/eval_policy_sim.py \ -c output/baselines/original_oat/hf/policy_ep-0250_sr-0.596.ckpt \ -o output/eval/blockwise/bw \ --use-blockwise --blockwise-prefix-len 4 --blockwise-refine-iters 1 \ --blockwise-tail-checkpoint checkpoints/original_oat_tail_p4_r1.pt \ --tokenizer-checkpoint output/baselines/original_oat/hf/tokenizer_ep-0950_mse-0.002.ckpt ``` ## Citation ```bibtex @misc{liu2026oatorderedactiontokenization, title={OAT: Ordered Action Tokenization}, author={Chaoqi Liu and Xiaoshen Han and Jiawei Gao and Yue Zhao and Haonan Chen and Yilun Du}, year={2026}, eprint={2602.04215}, archivePrefix={arXiv}, primaryClass={cs.RO}} ``` ## Phase 2 (next) Phase 1 strict baseline is **complete** on branch `Blockwise-OAT`. 1. Resume tail training from `original_oat_tail_p4_r1.pt` (target 30+ epochs). 2. Re-run paired AR vs Blockwise LIBERO-10 confirm eval. 3. Re-run speed / verification benchmarks; publish Phase 2 bundle to HF.