Proteus-Computer-Use
/

onevision-task-action-decoder

Model card Files Files and versions

onevision-task-action-decoder / README.md

yushg's picture

Upload README.md with huggingface_hub

de57ee4 verified 3 days ago

|

History Blame Contribute Delete

3.18 kB

	---
	license: other
	tags:
	- computer-use
	- video
	- action-decoder
	- onevision
	- qwen
	---

	# OneVision Task-Aware Action Decoder — checkpoints

	Lightweight task-aware action decoder for computer-use / instructional-video understanding.
	Built on a frozen OneVision vision encoder (`lmms-lab-encoder/onevision-encoder-large-lang`)
	and Qwen/Qwen3-4B as the text encoder/decoder, LoRA-tuned (r=8, alpha=16, dropout=0.05,
	applied to both vision and text). Trained with `train_onevision_lw_decoder.py` in
	[proteusagi/onevision-compuse](https://github.com/proteusagi/onevision-compuse) under
	`onevision-lightweight-decoder/`.

	Given video visible-token embeddings plus a task prompt, the model predicts, per frame:
	- discrete action / key / modifier classes,
	- a click heatmap (actor query attending over visual tokens),
	- (new_arch only) an autoregressive per-frame transcript.

	## Checkpoints (~8.7 GB each — LoRA adapters + task heads)

	### `new_arch/best.pt` — "parallel heads" architecture (2026-03-16) — RECOMMENDED
	- Adds an autoregressive transcript decoder (per-frame transcript via the Qwen LM head:
	visual features + transcript history of frames 0..t-1 predict frame t) on top of the
	action/key/modifier classification heads and the click-heatmap head.
	- Modularized `encode`/`decode`, parallel prediction heads, and an efficient-inference
	path (git commits: "parallel heads", "efficient inference", "further speedup").
	- Training performance: train action accuracy ~64-66%, train loss ~0.9 (see `accuracy.png`
	and `loss.png` in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics
	recover afterward.
	- Choose this by default: most capable (also generates transcripts) and the actively developed arch.

	### `onevision_task_action_decoder/best.pt` — original architecture (2026-03-10)
	- The original `OneVisionTaskAwareActionDecoder`: action/key/modifier classification heads plus a
	click-heatmap head over OneVision visual tokens, with frame self-attention and prompt
	cross-attention. Predates the transcript decoder (which was added 2026-03-16), so it does
	action/key/modifier + click only.
	- Choose this if you specifically want the lighter, action-only baseline or to reproduce the
	March-10 results.

	> Also in this repo: `onevision_task_action_decoder/epoch_*.pt` (earlier epoch snapshots of the
	> original run) and `onevision_task_action_decoder_8F/best.pt` (an 8-frame-clip variant).

	## Which to choose
	- `new_arch/best.pt` — full capability (action + transcript), latest architecture. Default pick.
	- `onevision_task_action_decoder/best.pt` — original action-only model / March-10 baseline.

	## Loading
	```python
	import torch
	ckpt = torch.load("new_arch/best.pt", map_location="cpu")
	# State for OneVisionTaskAwareActionDecoder:
	# vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
	# text_encoder_name = "Qwen/Qwen3-4B"
	# LoRA r=8, alpha=16 on both encoders
	```
	See `onevision-lightweight-decoder/src/model.py` in proteusagi/onevision-compuse for the module
	definition and `train_onevision_lw_decoder.py` for the training/eval pipeline.