OneVision Task-Aware Action Decoder — checkpoints

Lightweight task-aware action decoder for computer-use / instructional-video understanding. Built on a frozen OneVision vision encoder (lmms-lab-encoder/onevision-encoder-large-lang) and Qwen/Qwen3-4B as the text encoder/decoder, LoRA-tuned (r=8, alpha=16, dropout=0.05, applied to both vision and text). Trained with train_onevision_lw_decoder.py in proteusagi/onevision-compuse under onevision-lightweight-decoder/.

Given video visible-token embeddings plus a task prompt, the model predicts, per frame:

discrete action / key / modifier classes,
a click heatmap (actor query attending over visual tokens),
(new_arch only) an autoregressive per-frame transcript.

Checkpoints (~8.7 GB each — LoRA adapters + task heads)

`new_arch/best.pt` — "parallel heads" architecture (2026-03-16) — RECOMMENDED

Adds an autoregressive transcript decoder (per-frame transcript via the Qwen LM head: visual features + transcript history of frames 0..t-1 predict frame t) on top of the action/key/modifier classification heads and the click-heatmap head.
Modularized encode/decode, parallel prediction heads, and an efficient-inference path (git commits: "parallel heads", "efficient inference", "further speedup").
Training performance: train action accuracy ~64-66%, train loss ~0.9 (see accuracy.png and loss.png in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics recover afterward.
Choose this by default: most capable (also generates transcripts) and the actively developed arch.

`onevision_task_action_decoder/best.pt` — original architecture (2026-03-10)

The original OneVisionTaskAwareActionDecoder: action/key/modifier classification heads plus a click-heatmap head over OneVision visual tokens, with frame self-attention and prompt cross-attention. Predates the transcript decoder (which was added 2026-03-16), so it does action/key/modifier + click only.
Choose this if you specifically want the lighter, action-only baseline or to reproduce the March-10 results.

Also in this repo: onevision_task_action_decoder/epoch_*.pt (earlier epoch snapshots of the original run) and onevision_task_action_decoder_8F/best.pt (an 8-frame-clip variant).

Which to choose

new_arch/best.pt — full capability (action + transcript), latest architecture. Default pick.
onevision_task_action_decoder/best.pt — original action-only model / March-10 baseline.

Loading

import torch
ckpt = torch.load("new_arch/best.pt", map_location="cpu")
# State for OneVisionTaskAwareActionDecoder:
#   vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
#   text_encoder_name   = "Qwen/Qwen3-4B"
#   LoRA r=8, alpha=16 on both encoders

See onevision-lightweight-decoder/src/model.py in proteusagi/onevision-compuse for the module definition and train_onevision_lw_decoder.py for the training/eval pipeline.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support