Inverse Dynamics Model for Action-Annotating Screencasts

p-doom/idm is an inverse dynamics model that predicts user input actions from short windows of screen recordings. Given 10 consecutive screenshots, it emits the key presses, mouse clicks, cursor movements and scroll events that are visually implied by the frames.

Please refer to the IDM blog post for details and experiments.

Model Summary

Base model: Qwen/Qwen3-VL-8B-Instruct
Training data: macOS crowd-cast paired screencasts and OS input logs
Training method: LoRA on language and vision modules, merged after training
Input: 10 screenshots sampled at 5 FPS
Output: sparse JSON action list
Eval: 44 manually verified macOS productivity clips
Result: F1 0.86, MouseMove R² 0.66, MouseMove cosine similarity 0.99

Input Format

Provide one chat message with 10 images sampled at 5 FPS. Each image should be preceded by a text label:

Frame F00: <image>
Frame F01: <image>
...
Frame F09: <image>

The frame labels are text anchors in the message, not labels rendered into the image pixels.

Output Format

The model emits only a JSON array:

[
  {"frame": "F02", "type": "MouseMove", "details": "120,45"},
  {"frame": "F03", "type": "MouseClick", "details": "Left"},
  {"frame": "F05", "type": "KeyPress", "details": "Cmd+S"},
  {"frame": "F07", "type": "MouseScroll", "details": "-150"}
]

Action types:

KeyPress: key name with modifiers, e.g. Cmd+S, Return, A
MouseClick: Left, Right, or Middle
MouseMove: normalized dx,dy, where 1000 is a full screen-width or screen-height traversal
MouseScroll: normalized signed scroll magnitude

Frame attribution: if an effect first appears between F_K and F_{K+1}, report the action on F_K, the last pre-action frame.

Related Releases

p-doom/AGI-CAST-0.6k: source AGI-CAST screencasts
p-doom/AGI-CAST-idm-actions: AGI-CAST action annotations generated with this model
p-doom/idm-eval-set: manually verified IDM evaluation clips

Limitations

The model was trained on macOS clips and can confuse OS-specific shortcuts such as Cmd vs Ctrl.
Labels are inferred from pixels, so actions with no visual evidence can be missed or hallucinated.
Fine-grained timing, cursor movement magnitude and scroll magnitude can be noisy.

Downloads last month: 12

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for p-doom/idm

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(313)

this model