| --- |
| license: other |
| tags: |
| - computer-use |
| - video |
| - action-decoder |
| - onevision |
| - qwen |
| --- |
| |
| # OneVision Task-Aware Action Decoder β checkpoints |
|
|
| Lightweight task-aware action decoder for computer-use / instructional-video understanding. |
| Built on a frozen **OneVision vision encoder** (`lmms-lab-encoder/onevision-encoder-large-lang`) |
| and **Qwen/Qwen3-4B** as the text encoder/decoder, **LoRA**-tuned (r=8, alpha=16, dropout=0.05, |
| applied to both vision and text). Trained with `train_onevision_lw_decoder.py` in |
| [proteusagi/onevision-compuse](https://github.com/proteusagi/onevision-compuse) under |
| `onevision-lightweight-decoder/`. |
|
|
| Given video visible-token embeddings plus a task prompt, the model predicts, per frame: |
| - discrete **action / key / modifier** classes, |
| - a **click heatmap** (actor query attending over visual tokens), |
| - (new_arch only) an **autoregressive per-frame transcript**. |
| |
| ## Checkpoints (~8.7 GB each β LoRA adapters + task heads) |
| |
| ### `new_arch/best.pt` β "parallel heads" architecture (2026-03-16) β RECOMMENDED |
| - Adds an **autoregressive transcript decoder** (per-frame transcript via the Qwen LM head: |
| visual features + transcript history of frames 0..t-1 predict frame t) on top of the |
| action/key/modifier classification heads and the click-heatmap head. |
| - Modularized `encode`/`decode`, **parallel prediction heads**, and an **efficient-inference** |
| path (git commits: "parallel heads", "efficient inference", "further speedup"). |
| - Training performance: **train action accuracy ~64-66%**, train loss ~0.9 (see `accuracy.png` |
| and `loss.png` in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics |
| recover afterward. |
| - Choose this by default: most capable (also generates transcripts) and the actively developed arch. |
|
|
| ### `onevision_task_action_decoder/best.pt` β original architecture (2026-03-10) |
| - The original `OneVisionTaskAwareActionDecoder`: action/key/modifier classification heads plus a |
| click-heatmap head over OneVision visual tokens, with frame self-attention and prompt |
| cross-attention. **Predates the transcript decoder** (which was added 2026-03-16), so it does |
| action/key/modifier + click only. |
| - Choose this if you specifically want the lighter, action-only baseline or to reproduce the |
| March-10 results. |
| |
| > Also in this repo: `onevision_task_action_decoder/epoch_*.pt` (earlier epoch snapshots of the |
| > original run) and `onevision_task_action_decoder_8F/best.pt` (an 8-frame-clip variant). |
| |
| ## Which to choose |
| - **`new_arch/best.pt`** β full capability (action + transcript), latest architecture. Default pick. |
| - **`onevision_task_action_decoder/best.pt`** β original action-only model / March-10 baseline. |
| |
| ## Loading |
| ```python |
| import torch |
| ckpt = torch.load("new_arch/best.pt", map_location="cpu") |
| # State for OneVisionTaskAwareActionDecoder: |
| # vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang" |
| # text_encoder_name = "Qwen/Qwen3-4B" |
| # LoRA r=8, alpha=16 on both encoders |
| ``` |
| See `onevision-lightweight-decoder/src/model.py` in proteusagi/onevision-compuse for the module |
| definition and `train_onevision_lw_decoder.py` for the training/eval pipeline. |
|
|