Instructions to use MeowML/Visual-Primitives-Qwen2.5-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MeowML/Visual-Primitives-Qwen2.5-3B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MeowML/Visual-Primitives-Qwen2.5-3B", dtype="auto") - PEFT
How to use MeowML/Visual-Primitives-Qwen2.5-3B with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
π§ Thinking with Visual Primitives (3B Proof-of-Concept)
This repository provides the inference code and LoRA adapter weights for a 3B-parameter proof-of-concept replication of the "Thinking with Visual Primitives" paradigm introduced by DeepSeek-AI.
While standard Multimodal LLMs have largely solved the Perception Gap through high-resolution cropping, they still suffer from the Reference Gap: the inherent inability of natural language to serve as a precise, unambiguous pointer within a continuous visual space. This model elevates spatial markersβspecifically bounding boxesβto "minimal units of thought". By interleaving these visual primitives directly into its Chain-of-Thought (CoT), the model can literally "point" while it "reasons", effectively anchoring abstract linguistic thoughts onto concrete spatial coordinates.
Note: This is an independent, open-weight 3B proof-of-concept designed to demonstrate the architectural viability of visual primitive grounding. The original paper utilizes a proprietary 284B-A13B MoE architecture. Training code is not included in this release.
π Training Dataset
This model was trained exclusively on the COCO Object Detection dataset (detection-datasets/coco):
- SFT Phase: 50,000 samples from the COCO
trainsplit, filtered using a Visual-Geometric Quality Review to remove Mega Boxes (>90% area) and tiny ambiguous boxes (<1% area). - GRPO (RL) Phase: 5,000 samples from the COCO
validationsplit, filtered for "Normal-Level" difficulty (2β10 objects per image, target object occupying 5β60% of the image area) to ensure non-trivial RL learning signals.
ποΈ Architecture
The model is built on a highly optimized, lightweight vision-language pipeline mirroring the paper's token-efficiency philosophy:
- Vision Encoder:
google/siglip-so400m-patch14-384 - Spatial Compressor: A
3x3Average Pooling layer that compresses adjacent patch tokens to maximize visual token efficiency. - Projector: A 2-layer MLP (GELU) bridging the vision and text embedding spaces.
- Language Backbone:
Qwen/Qwen2.5-3B - Vocabulary Extension: Added special tokens
<ref>,</ref>,<box>,</box>,<point>,</point>to natively support visual primitive generation.
Output Format:
1. **Intent Analysis**: The user wants me to locate the [object].
2. **Visual Grounding**: Scanning the scene... <ref>object</ref><box>[[x1,y1,x2,y2]]</box>
3. **Conclusion**: Coordinates anchored.
(Coordinates are normalized to a 0-999 discrete grid relative to the padded image dimensions).
π― Example Inference
Prompt: "Locate the person in this image."
Model Output:
- Intent Analysis: The user wants me to locate the person in the image.
- Visual Grounding: Scanning the scene, I have identified the target entity. person[[327,162,625,825]]
- Conclusion: The spatial coordinates have been successfully anchored.
π Quick Start (Inference)
Ensure you have the required dependencies installed:
pip install torch transformers peft accelerate Pillow
Command Line Usage
python inference.py --image "path/to/image.jpg" --target "person"
Python API Usage
import torch
import re
from PIL import Image, ImageDraw, ImageFont
from transformers import AutoProcessor, AutoTokenizer
from peft import PeftModel
from model import VisualPrimitiveModel
# 1. Load Tokenizer and Processor
llm_id = "Qwen/Qwen2.5-3B"
vit_id = "google/siglip-so400m-patch14-384"
adapter_id = "MeowML/Visual-Primitives-Qwen2.5-3B"
tokenizer = AutoTokenizer.from_pretrained(llm_id)
tokenizer.add_tokens(["<ref>", "</ref>", "<box>", "</box>", "<point>", "</point>"])
processor = AutoProcessor.from_pretrained(vit_id)
# 2. Load Base Model and LoRA Adapter
base_model = VisualPrimitiveModel(llm_path=llm_id, vit_path=vit_id)
base_model.llm.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base_model, adapter_id)
model = model.to("cuda", dtype=torch.float16).eval()
# 3. Prepare Image and Prompt (with square padding)
image = Image.open("test_image.jpg").convert("RGB")
orig_w, orig_h = image.size
max_dim = max(orig_w, orig_h)
padded_image = Image.new("RGB", (max_dim, max_dim), (255, 255, 255))
pad_x = (max_dim - orig_w) // 2
pad_y = (max_dim - orig_h) // 2
padded_image.paste(image, (pad_x, pad_y))
prompt = "Locate the main subject in this image and output its spatial location using the format: <ref>object_name</ref><box>[[x1,y1,x2,y2]]</box>."
chat_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(chat_prompt, return_tensors="pt").to("cuda")
pixel_values = processor(images=padded_image, return_tensors="pt")["pixel_values"].to("cuda", dtype=torch.float16)
# 4. Generate
outputs = model.generate(**inputs, pixel_values=pixel_values, max_new_tokens=150, do_sample=False)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
print(generated_text)
# 5. Reverse Padding Math to get original coordinates
matches = re.findall(r'<box>\[\[(\d{1,3}),\s*(\d{1,3}),\s*(\d{1,3}),\s*(\d{1,3})\]\]</box>', generated_text)
if matches:
x1, y1, x2, y2 = map(int, matches[-1])
abs_x1 = max(0, min(int(((x1 / 999) * max_dim) - pad_x), orig_w))
abs_y1 = max(0, min(int(((y1 / 999) * max_dim) - pad_y), orig_h))
abs_x2 = max(0, min(int(((x2 / 999) * max_dim) - pad_x), orig_w))
abs_y2 = max(0, min(int(((y2 / 999) * max_dim) - pad_y), orig_h))
print(f"Original Coordinates: [{abs_x1}, {abs_y1}, {abs_x2}, {abs_y2}]")
π Acknowledgements & Citation
This project is an independent architectural replication inspired by the groundbreaking research from DeepSeek-AI:
@article{lu2026thinking,
title={Thinking with Visual Primitives},
author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and others},
journal={arXiv preprint},
year={2026}
}
Special thanks to:
- DeepSeek-AI for open-sourcing the "Thinking with Visual Primitives" methodology and highlighting the Reference Gap.
- Qwen Team for the excellent
Qwen2.5-3Bbase model. - Google for the
SigLIPvision encoder.
Model tree for MeowML/Visual-Primitives-Qwen2.5-3B
Base model
Qwen/Qwen2.5-3B