Calvin ABC→D VLA Checkpoints (starVLA)
Vision-Language-Action models trained on the CALVIN ABC→D benchmark (train on environments A/B/C, evaluate on held-out environment D) using the starVLA framework.
Each run directory contains checkpoints/steps_<N>_pytorch_model.pt and the corresponding eval_steps<N>_*/calvin_eval/results.json where evaluated.
Runs
| Folder | Backbone | Init | Framework | Steps (save interval) | Best avg_seq_len |
|---|---|---|---|---|---|
qwen3vl_2b_pi_v3_1519 |
Qwen3-VL-2B-Instruct | vanilla | PI_v3 (flow-matching, chunk-10, aug, b256) | 200K (25K) | 3.336 @ 25K |
internvl35_1b_pt0221_pi_v3_1577 |
InternVL3.5-1B | embodied-PT 0221 | PI_v3 | 100K→TIMEOUT@75K (12.5K) | 1.076 @ 75K |
qwen3vl_2b_pt0221_pi_v3_1578 |
Qwen3-VL-2B-Instruct | embodied-PT 0221 | PI_v3 | 100K (12.5K) | 3.064 @ 100K |
internvl35_1b_vanilla_pi_v3_1579 |
InternVL3.5-1B | vanilla | PI_v3 | 100K (12.5K) | 2.052 @ 50K |
internvl35_1b_pt0221_oft_1844 |
InternVL3.5-1B | embodied-PT 0221 | OFT (DiT-B, chunk-10, b64) | 80K (10K) | not yet evaluated |
qwen3vl_2b_pt0221_oft_1845 |
Qwen3-VL-2B-Instruct | embodied-PT 0221 | OFT | 80K (10K) | not yet evaluated |
internvl35_1b_vanilla_oft_1846 |
InternVL3.5-1B | vanilla | OFT | 80K (10K) | not yet evaluated |
internvl35_1b_pt0210_oft_1847 |
InternVL3.5-1B | embodied-PT 0210 | OFT | CANCELLED @ 10K (partial) | not yet evaluated |
Key findings (PI_v3 runs)
- Best model overall: Qwen3-VL-2B vanilla, PI_v3, step 25K → 3.336 avg_seq_len.
- Embodied pretraining (PT0221) hurts InternVL3.5-1B on Calvin: vanilla beats PT0221 at every matched checkpoint (mean Δ ≈ +0.55 avg_seq_len).
- Long training overfits: 1519 peaks at 25K (3.336) then decays to 2.84 by 200K.
Eval protocol
CALVIN long-horizon eval, 1000 chained sequences of 5 subtasks each. avg_seq_len ∈ [0, 5] is the mean number of consecutive subtasks completed.