Inverse Dynamics Model for Action-Annotating Screencasts
p-doom/idm is an inverse dynamics model that predicts user input actions from short windows of screen recordings. Given 10 consecutive screenshots, it emits the key presses, mouse clicks, cursor movements and scroll events that are visually implied by the frames.
Please refer to the IDM blog post for details and experiments.
Model Summary
- Base model:
Qwen/Qwen3-VL-8B-Instruct - Training data: macOS crowd-cast paired screencasts and OS input logs
- Training method: LoRA on language and vision modules, merged after training
- Input: 10 screenshots sampled at 5 FPS
- Output: sparse JSON action list
- Eval: 44 manually verified macOS productivity clips
- Result: F1 0.86, MouseMove R² 0.66, MouseMove cosine similarity 0.99
Input Format
Provide one chat message with 10 images sampled at 5 FPS. Each image should be preceded by a text label:
Frame F00: <image>
Frame F01: <image>
...
Frame F09: <image>
The frame labels are text anchors in the message, not labels rendered into the image pixels.
Output Format
The model emits only a JSON array:
[
{"frame": "F02", "type": "MouseMove", "details": "120,45"},
{"frame": "F03", "type": "MouseClick", "details": "Left"},
{"frame": "F05", "type": "KeyPress", "details": "Cmd+S"},
{"frame": "F07", "type": "MouseScroll", "details": "-150"}
]
Action types:
KeyPress: key name with modifiers, e.g.Cmd+S,Return,AMouseClick:Left,Right, orMiddleMouseMove: normalizeddx,dy, where1000is a full screen-width or screen-height traversalMouseScroll: normalized signed scroll magnitude
Frame attribution: if an effect first appears between F_K and F_{K+1}, report the action on F_K, the last pre-action frame.
Related Releases
p-doom/AGI-CAST-0.6k: source AGI-CAST screencastsp-doom/AGI-CAST-idm-actions: AGI-CAST action annotations generated with this modelp-doom/idm-eval-set: manually verified IDM evaluation clips
Limitations
- The model was trained on macOS clips and can confuse OS-specific shortcuts such as
CmdvsCtrl. - Labels are inferred from pixels, so actions with no visual evidence can be missed or hallucinated.
- Fine-grained timing, cursor movement magnitude and scroll magnitude can be noisy.
- Downloads last month
- 12
Model tree for p-doom/idm
Base model
Qwen/Qwen3-VL-8B-Instruct