FloorplanVLM Training
Fine-tune Qwen2.5-VL-3B to extract wall, door, and window geometry from floor plan images as structured JSON.
Based on FloorplanVLM (arxiv:2602.06507) — two-stage training:
- SFT on CubiCasa5K (5000 real floor plans)
- GRPO with geometric reward functions (wall IoU, room IoU, JSON validity)
Quick Start
# Install dependencies
pip install torch torchvision transformers trl peft datasets accelerate shapely Pillow lxml numpy tqdm huggingface_hub
# Optional (faster attention on GPU)
pip install flash-attn
# Login to HuggingFace
huggingface-cli login
# Stage 1: SFT Training
python train_floorplan_vlm.py
# Stage 2: GRPO Training (after SFT completes)
python train_floorplan_grpo.py
What it does
- Downloads CubiCasa5K dataset (~5GB) from Zenodo automatically
- Converts SVG floor plan annotations → structured JSON (walls with coordinates, doors, windows, rooms)
- Trains Qwen2.5-VL-3B with LoRA to predict this JSON from floor plan images
- Pushes the model to HuggingFace Hub
- Auto-detects GPU vs CPU (GPU recommended for full training)
Configuration
Edit the top of each script:
| Setting | Default | Description |
|---|---|---|
MAX_SAMPLES |
None (all) |
Set to 100 for a quick test run |
NUM_EPOCHS |
2 |
Training epochs |
PUSH_TO_HUB |
True |
Push model to HF Hub |
HUB_MODEL_ID |
manitocross/floorplan-vlm-sft |
Your model repo |
Hardware Requirements
| Mode | VRAM | Time (full dataset) |
|---|---|---|
| GPU (A100 80GB) | ~20GB | ~4-6 hours |
| GPU (RTX 3090/4090) | ~20GB | ~8-12 hours |
| CPU | ~14GB RAM | ~days (for testing only) |
Output JSON Schema
{
"walls": [
{
"id": "wall_1",
"start": [120, 80],
"end": [520, 80],
"thickness": 15,
"curvature": 0,
"openings": [
{"type": "door", "center": 320, "width": 90},
{"type": "window", "center": 450, "width": 60}
]
}
],
"rooms": [
{"label": "bedroom", "walls": ["wall_1", "wall_2", "wall_3", "wall_4"]}
]
}
GRPO Reward Functions
Stage 2 uses geometric rewards from the FloorplanVLM paper:
- R_val (0.1 weight): JSON validity + schema compliance
- R_ext (0.5 weight): External wall boundary IoU (Shapely polygon comparison)
- R_int (0.4 weight): Room IoU, gated by α when external walls are wrong