p(doom)

Homepage Hugging Face Discord GitHub Twitter Follow


Inverse Dynamics Model for Action-Annotating Screencasts

p-doom/idm is an inverse dynamics model that predicts user input actions from short windows of screen recordings. Given 10 consecutive screenshots, it emits the key presses, mouse clicks, cursor movements and scroll events that are visually implied by the frames.

Please refer to the IDM blog post for details and experiments.

Model Summary

  • Base model: Qwen/Qwen3-VL-8B-Instruct
  • Training data: macOS crowd-cast paired screencasts and OS input logs
  • Training method: LoRA on language and vision modules, merged after training
  • Input: 10 screenshots sampled at 5 FPS
  • Output: sparse JSON action list
  • Eval: 44 manually verified macOS productivity clips
  • Result: F1 0.86, MouseMove R² 0.66, MouseMove cosine similarity 0.99

Input Format

Provide one chat message with 10 images sampled at 5 FPS. Each image should be preceded by a text label:

Frame F00: <image>
Frame F01: <image>
...
Frame F09: <image>

The frame labels are text anchors in the message, not labels rendered into the image pixels.

Output Format

The model emits only a JSON array:

[
  {"frame": "F02", "type": "MouseMove", "details": "120,45"},
  {"frame": "F03", "type": "MouseClick", "details": "Left"},
  {"frame": "F05", "type": "KeyPress", "details": "Cmd+S"},
  {"frame": "F07", "type": "MouseScroll", "details": "-150"}
]

Action types:

  • KeyPress: key name with modifiers, e.g. Cmd+S, Return, A
  • MouseClick: Left, Right, or Middle
  • MouseMove: normalized dx,dy, where 1000 is a full screen-width or screen-height traversal
  • MouseScroll: normalized signed scroll magnitude

Frame attribution: if an effect first appears between F_K and F_{K+1}, report the action on F_K, the last pre-action frame.

Related Releases

Limitations

  • The model was trained on macOS clips and can confuse OS-specific shortcuts such as Cmd vs Ctrl.
  • Labels are inferred from pixels, so actions with no visual evidence can be missed or hallucinated.
  • Fine-grained timing, cursor movement magnitude and scroll magnitude can be noisy.
Downloads last month
12
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p-doom/idm

Finetuned
(313)
this model