yushg's picture
Upload README.md with huggingface_hub
de57ee4 verified
|
Raw
History Blame Contribute Delete
3.18 kB
---
license: other
tags:
- computer-use
- video
- action-decoder
- onevision
- qwen
---
# OneVision Task-Aware Action Decoder β€” checkpoints
Lightweight task-aware action decoder for computer-use / instructional-video understanding.
Built on a frozen **OneVision vision encoder** (`lmms-lab-encoder/onevision-encoder-large-lang`)
and **Qwen/Qwen3-4B** as the text encoder/decoder, **LoRA**-tuned (r=8, alpha=16, dropout=0.05,
applied to both vision and text). Trained with `train_onevision_lw_decoder.py` in
[proteusagi/onevision-compuse](https://github.com/proteusagi/onevision-compuse) under
`onevision-lightweight-decoder/`.
Given video visible-token embeddings plus a task prompt, the model predicts, per frame:
- discrete **action / key / modifier** classes,
- a **click heatmap** (actor query attending over visual tokens),
- (new_arch only) an **autoregressive per-frame transcript**.
## Checkpoints (~8.7 GB each β€” LoRA adapters + task heads)
### `new_arch/best.pt` β€” "parallel heads" architecture (2026-03-16) β€” RECOMMENDED
- Adds an **autoregressive transcript decoder** (per-frame transcript via the Qwen LM head:
visual features + transcript history of frames 0..t-1 predict frame t) on top of the
action/key/modifier classification heads and the click-heatmap head.
- Modularized `encode`/`decode`, **parallel prediction heads**, and an **efficient-inference**
path (git commits: "parallel heads", "efficient inference", "further speedup").
- Training performance: **train action accuracy ~64-66%**, train loss ~0.9 (see `accuracy.png`
and `loss.png` in the repo). The epoch ~85-117 plateau is a curriculum/data change; metrics
recover afterward.
- Choose this by default: most capable (also generates transcripts) and the actively developed arch.
### `onevision_task_action_decoder/best.pt` β€” original architecture (2026-03-10)
- The original `OneVisionTaskAwareActionDecoder`: action/key/modifier classification heads plus a
click-heatmap head over OneVision visual tokens, with frame self-attention and prompt
cross-attention. **Predates the transcript decoder** (which was added 2026-03-16), so it does
action/key/modifier + click only.
- Choose this if you specifically want the lighter, action-only baseline or to reproduce the
March-10 results.
> Also in this repo: `onevision_task_action_decoder/epoch_*.pt` (earlier epoch snapshots of the
> original run) and `onevision_task_action_decoder_8F/best.pt` (an 8-frame-clip variant).
## Which to choose
- **`new_arch/best.pt`** β€” full capability (action + transcript), latest architecture. Default pick.
- **`onevision_task_action_decoder/best.pt`** β€” original action-only model / March-10 baseline.
## Loading
```python
import torch
ckpt = torch.load("new_arch/best.pt", map_location="cpu")
# State for OneVisionTaskAwareActionDecoder:
# vision_encoder_name = "lmms-lab-encoder/onevision-encoder-large-lang"
# text_encoder_name = "Qwen/Qwen3-4B"
# LoRA r=8, alpha=16 on both encoders
```
See `onevision-lightweight-decoder/src/model.py` in proteusagi/onevision-compuse for the module
definition and `train_onevision_lw_decoder.py` for the training/eval pipeline.