lsnu's picture
Add files using upload-large-folder tool
f7a3ee4 verified

pi0.5 Packed Multi-Arm OpenPI Artifacts

This repo packages the full local artifact set for packed-action-head studies on pi0.5 across TWIN handover and TWIN dual-push, including:

  • all finished checkpoints under openpi/checkpoints/
  • the modified openpi/ training and evaluation code
  • train/eval logs and structured metric tables
  • reproducibility manifests and environment snapshots

Three runs are included:

  1. an initial 2K baseline-vs-parallel comparison
  2. a longer 10K follow-up on the same packed setup
  3. a 5K dual-push 128 screening study on the same packed path
  4. a 2K dual-push 128 four-way step comparison across shared, head_only_parallel, split_independent, and split_communicating

This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:

  • exact single-to-split warm-start checkpoints for split_independent and split_communicating
  • invariant checks for the new split architecture
  • detached real-data smoke and 20-step training runs on lsnu/twin_dual_push_128_train
  • the code changes that introduce the new split-expert action path

Experiment setup

  • Handover train/val: lsnu/twin_handover_256_train, lsnu/twin_handover_256_val
  • Dual-push train/val: lsnu/twin_dual_push_128_train, lsnu/twin_dual_push_128_val
  • Hardware: 4x H100 80GB
  • Precision: bfloat16
  • Semantic packed layout: [L8, 0x8, R8, 0x8]
  • Active action-loss dims: [0:8] and [16:24]
  • Masked padded dims: [8:16] and [24:32]

Headline results

Teacher-forced masked validation loss:

Model 2K @ final 10K @ 1K 10K @ 2K 10K @ 5K 10K @ 10K
Packed baseline 0.035776 0.061130 0.041595 0.027324 0.022345
Packed parallel 0.035680 0.059715 0.039947 0.027340 0.022168

Sample-based eval on the fixed 10K final validation subset:

Model 4-step masked MAE 10-step masked MAE Train runtime Peak VRAM
Packed baseline 0.029935 0.030294 2:13:40 35.23GB
Packed parallel 0.029277 0.030241 2:20:51 35.27GB

The long run still shows a very small parallel edge on teacher-forced validation loss by 10K, while the sample-based eval is essentially a tie.

Dual-push 128 screening results:

Model 1K val loss 2K val loss 5K val loss 5K 4-step MAE 5K 10-step MAE Train runtime
Packed baseline 0.095597 0.083194 0.055958 0.056830 0.058973 1:05:25
Packed parallel 0.093704 0.082729 0.055242 0.054630 0.056627 1:00:33

The dual-push screening run shows a small but consistent parallel edge at 1K, 2K, and 5K on both teacher-forced validation loss and fixed-subset sample MAE.

Dual-push 128 four-way 2K step comparison raw results:

Step-0 teacher-forced masked validation loss:

Model Step-0 val loss Step-0 left/right imbalance
Shared 1.084735 0.505345
Head-only parallel 1.082985 0.501182
Split independent 1.328262 0.448843
Split communicating 1.783048 0.671085

Step-2000 teacher-forced masked validation loss:

Model Step-2000 val loss Step-2000 left/right imbalance
Shared 0.055329 0.069564
Head-only parallel 0.055297 0.069380
Split independent 0.063537 0.092029
Split communicating 0.059952 0.080435

Step-2000 sample masked MAE:

Model 1-step MAE 4-step MAE 16-step MAE
Shared 0.087330 0.078164 0.085222
Head-only parallel 0.086764 0.078301 0.085272
Split independent 0.079100 0.070436 0.075281
Split communicating 0.078618 0.071087 0.075570

Full raw tables for the 0/100/500/2000 sweep live in:

  • artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv
  • artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv
  • artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv

Warm-start note

The packed parallel warm-start uses the slice/fuse mapping implemented in openpi/scripts/init_parallel_pi05_from_single_pytorch.py, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:

  • handover 10K: input_projection_max_abs_diff = 0.00122881, masked_loss_abs_diff = 0.00398052
  • dual-push 5K: input_projection_max_abs_diff = 0.00099802, masked_loss_abs_diff = 0.08580410
  • both checks report warmstart_equivalent = False

So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.

Split-Expert Bring-Up (2026-03-10)

The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is action_expert_mode with:

  • shared
  • head_only_parallel
  • split_independent
  • split_communicating

Key bring-up results:

  • the split warm-start copies the original single gemma_expert into exact left/right expert branches for both split modes
  • split_independent passes the branch-local invariants:
    • identical left/right inputs produce identical suffix outputs
    • perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
  • both split modes pass detached real-data training on packed TWIN dual-push:
    • 3-step real-data smoke run with checkpoint save
    • 20-step real-data training run with checkpoint save
  • the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data 20-step run

New bring-up artifact bundle:

  • artifacts/twin_split_expert_bringup_20260310/
    • split warm-start checkpoints
    • invariant-check outputs
    • reproducibility commands
    • summary README for the split-expert bring-up

Repo layout

  • openpi/
    • modified source and scripts used for training/eval
    • copied norm-stats assets for the packed configs
    • full 2K, 10K, and dual-push 5K checkpoint trees
  • artifacts/twin_handover_packed_parallelization_20260309/
    • initial 2K study bundle
  • artifacts/twin_handover_packed_parallelization_10k_20260309/
    • 10K follow-up bundle with metrics, logs, repro manifests, and environment snapshot
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/
    • dual-push 128 screening bundle with metrics, logs, repro manifests, and environment snapshot
  • artifacts/twin_dual_push_128_stepcmp_2k_20260311/
    • dual-push 128 four-way 2K step-comparison bundle with metrics, logs, repro manifests, and environment snapshot
  • artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/
    • small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle
  • artifacts/twin_split_expert_bringup_20260310/
    • split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks

Committed artifact note

For this update, the committed artifact payloads are:

  • artifacts/twin_dual_push_128_stepcmp_2k_20260311/
    • the official finalized 4-model dual-push 2K step-comparison bundle
  • artifacts/twin_split_expert_bringup_20260310/
    • the split-expert bring-up bundle used as the sanity and warm-start reference
  • artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/
    • a small debug-only environment snapshot from the failed/resumed bring-up sequence

The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-_debug step-comparison bundle plus the split bring-up bundle.

  • openpi/run_logs/
    • raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under artifacts/twin_split_expert_bringup_20260310/run_logs/
  • openpi/scripts/upload_stepcmp_bundle_to_hf.py
    • the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses huggingface_hub.HfApi.upload_large_folder(...)
  • artifacts/pi05_base_params/
    • staged base parameter snapshot used during JAX-to-PyTorch conversion

Future commit/upload workflow

When adding new experiment results to this repo:

  • keep the canonical bundle under artifacts/<study_name>/ and only retain the checkpoint steps that are scientifically required under openpi/checkpoints/
  • before claiming the repo is fully committed, audit ignored artifact paths explicitly:
    • git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs
  • if a result is intentionally kept in an ignored path such as openpi/checkpoints/ or openpi/run_logs/, force-add it explicitly with git add --sparse -f ...
  • use openpi/scripts/upload_stepcmp_bundle_to_hf.py for large HF uploads; it uses huggingface_hub.HfApi.upload_large_folder(...) and is the preferred path for checkpoint-heavy updates
  • never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in HF_TOKEN or load it from HF_TOKEN_FILE, and check for literal hf_... strings before committing

Key files

  • Full report: REPORT.md
  • 2K summary: artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json
  • 10K summary: artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json
  • 10K comparison table: artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv
  • dual-push 5K summary: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json
  • dual-push 5K teacher-forced table: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv
  • dual-push 5K sample eval table: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv
  • dual-push 5K environment snapshot: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/
  • dual-push 2K step-comparison summary: artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json
  • dual-push 2K step-comparison README: artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md
  • dual-push 2K teacher-forced table: artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv
  • dual-push 2K sample eval table: artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv
  • dual-push 2K training summary: artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv
  • split-expert bring-up summary: artifacts/twin_split_expert_bringup_20260310/README.md
  • split-expert repro commands: artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh
  • split-expert invariant check outputs: artifacts/twin_split_expert_bringup_20260310/sanity_checks/
  • split-expert real-data logs: openpi/run_logs/split_independent_real_smoke3_r2.log, openpi/run_logs/split_communicating_real_smoke3.log, openpi/run_logs/split_independent_real_train20.log, openpi/run_logs/split_communicating_real_train20.log
  • split-expert real-data checkpoints: openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/, openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/
  • 10K repro commands: artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh
  • 10K changed-file manifest: artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
  • 10K environment snapshot: artifacts/twin_handover_packed_parallelization_10k_20260309/environment/

Main changed files

Initial 2K + 10K study logic lives primarily in:

  • openpi/src/openpi/transforms.py
  • openpi/src/openpi/training/config.py
  • openpi/src/openpi/training/data_loader.py
  • openpi/src/openpi/models/model.py
  • openpi/src/openpi/models/tokenizer.py
  • openpi/src/openpi/models_pytorch/pi0_pytorch.py
  • openpi/scripts/train_pytorch.py
  • openpi/scripts/eval_twin_val_loss_pytorch.py
  • openpi/scripts/init_parallel_pi05_from_single_pytorch.py
  • openpi/scripts/inspect_twin_packed_batch.py
  • openpi/scripts/check_parallel_warmstart_equivalence.py
  • openpi/scripts/check_split_expert_invariants.py
  • openpi/scripts/run_twin_handover_packed_followup.sh
  • openpi/scripts/run_twin_handover_packed_10k.sh
  • openpi/scripts/run_twin_dual_push_128_packed_5k.sh

The per-file rationale is recorded in:

  • artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt
  • artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt