OneVision Task-Aware Action Decoder β checkpoints
Lightweight task-aware action decoder for computer-use / instructional-video understanding.
Built on a frozen OneVision vision encoder (lmms-lab-encoder/onevision-encoder-large-lang)
and Qwen/Qwen3-4B as the text encoder/decoder, LoRA-tuned (r=8, alpha=16, dropout=0.05,
applied to both vision and text). Trained with train_onevision_lw_decoder.py in
proteusagi/onevision-compuse under
onevision-lightweight-decoder/.
Given video visible-token embeddings plus a task prompt, the model predicts, per frame:
- discrete action / key / modifier classes,
- a click heatmap (actor query attending over visual tokens),
- (new_arch only) an autoregressive per-frame transcript.
Checkpoints (~8.7 GB each β LoRA adapters + task heads)
new_arch/best.pt β "parallel heads" architecture (2026-03-16) β RECOMMENDED
- Adds an autoregressive transcript decoder (per-frame transcript via the Qwen LM head: visual features + transcript history of frames 0..t-1 predict frame t) on top of the action/key/modifier classification heads and the click-heatmap head.
- Modularized
encode/decode, parallel prediction heads, and an efficient-inference path (git commits: "parallel heads", "efficient inference", "further speedup"). - Training performance: train action accuracy ~64-66%, train loss ~0.9 (see
accuracy.pngandloss.pngin the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics recover afterward. - Choose this by default: most capable (also generates transcripts) and the actively developed arch.
onevision_task_action_decoder/best.pt β original architecture (2026-03-10)
- The original
OneVisionTaskAwareActionDecoder: action/key/modifier classification heads plus a click-heatmap head over OneVision visual tokens, with frame self-attention and prompt cross-attention. Predates the transcript decoder (which was added 2026-03-16), so it does action/key/modifier + click only. - Choose this if you specifically want the lighter, action-only baseline or to reproduce the March-10 results.
Also in this repo:
onevision_task_action_decoder/epoch_*.pt(earlier epoch snapshots of the original run) andonevision_task_action_decoder_8F/best.pt(an 8-frame-clip variant).
Which to choose
new_arch/best.ptβ full capability (action + transcript), latest architecture. Default pick.onevision_task_action_decoder/best.ptβ original action-only model / March-10 baseline.
Loading
import torch
ckpt = torch.load("new_arch/best.pt", map_location="cpu")
# State for OneVisionTaskAwareActionDecoder:
# vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
# text_encoder_name = "Qwen/Qwen3-4B"
# LoRA r=8, alpha=16 on both encoders
See onevision-lightweight-decoder/src/model.py in proteusagi/onevision-compuse for the module
definition and train_onevision_lw_decoder.py for the training/eval pipeline.