qwen-dit-draw / README.md
HusseinLezzaik's picture
Upload README.md with huggingface_hub
25007bc verified
metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-3B-Instruct
datasets:
  - TESS-Computer/quickdraw-circles
tags:
  - trajectory-prediction
  - diffusion-transformer
  - vision-language
  - robotics
  - drawing
pipeline_tag: image-to-image

Qwen-DiT-Draw

A Vision-Language Model with Diffusion Transformer head for trajectory prediction. Given an image and instruction, the model predicts drawing trajectories.

Architecture: Frozen Qwen2.5-VL-3B backbone + trainable DiT action head (36.7M params)

Model Details

  • Base Model: Qwen/Qwen2.5-VL-3B-Instruct
  • Training Data: TESS-Computer/quickdraw-circles (21k circle drawings)
  • Architecture: GR00T-style chunked prediction with flow matching
  • Trainable Parameters: 36.7M (DiT head only, VLM frozen)
  • Chunk Size: 16 points per chunk
  • Output: (x, y, state) where state > 0.5 indicates stop signal

Usage

import torch
from PIL import Image
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

# You need the model code from: https://github.com/HusseinLezzaik/Qwen-DiT-Draw
from src.model import Qwen2_5_VL_Draw, TrajectoryConfig

# Load model
config = TrajectoryConfig(chunk_size=16, dit_hidden_size=512, dit_num_layers=6)
model = Qwen2_5_VL_Draw(
    model_id="Qwen/Qwen2.5-VL-3B-Instruct",
    config=config,
    freeze_backbone=True,
    dtype=torch.bfloat16,
)

# Load trained weights
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download(repo_id="TESS-Computer/qwen-dit-draw", filename="trajectory_head.pt")
model.trajectory_head.load_state_dict(torch.load(weights_path, weights_only=True))
model = model.to("cuda").eval()

# Load processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# Create input
image = Image.new("RGB", (512, 512), "white")  # White canvas
instruction = "draw a circle"

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408},
        {"type": "text", "text": instruction},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text=[text], images=image_inputs, return_tensors="pt")
inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}

# Predict trajectory chunk
with torch.no_grad():
    chunk = model.predict_chunk(**inputs)

chunk = chunk[0].float().cpu().numpy()  # (16, 3) - (x, y, state)
print(f"Predicted {len(chunk)} points")
for i, (x, y, state) in enumerate(chunk):
    print(f"  Point {i}: ({x:.3f}, {y:.3f}), stop={state > 0.5}")

Multi-Chunk Inference (Full Drawing)

For complete drawings, use visual feedback loop:

from PIL import ImageDraw

canvas = Image.new("RGB", (512, 512), "white")
all_points = []
max_chunks = 10

for chunk_idx in range(max_chunks):
    # Prepare inputs with current canvas
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": canvas, "min_pixels": 200704, "max_pixels": 401408},
            {"type": "text", "text": "draw a circle"},
        ],
    }]
    # ... process and predict ...

    # Draw on canvas (use BLACK lines to match training!)
    draw = ImageDraw.Draw(canvas)
    for i in range(1, len(chunk)):
        x1, y1 = int(chunk[i-1][0] * 512), int(chunk[i-1][1] * 512)
        x2, y2 = int(chunk[i][0] * 512), int(chunk[i][1] * 512)
        draw.line([(x1, y1), (x2, y2)], fill='black', width=2)

        if chunk[i][2] > 0.5:  # Stop signal
            break

Training

Trained on Modal H100 for 2 epochs using flow matching loss. See training code.

Citation

@misc{qwen-dit-draw,
  author = {TESS Computer},
  title = {Qwen-DiT-Draw: VLM + DiT for Trajectory Prediction},
  year = {2025},
  url = {https://huggingface.co/TESS-Computer/qwen-dit-draw}
}

Links