MVRL
/

VectorSynth

+---
+license: apache-2.0
+tags:
+  - controlnet
+  - stable-diffusion
+  - satellite-imagery
+  - osm
+  - image-to-image
+  - diffusers
+base_model: stabilityai/stable-diffusion-2-1-base
+pipeline_tag: image-to-image
+library_name: diffusers
+---
+# VectorSynth
+**VectorSynth** is a ControlNet model that generates satellite imagery from OpenStreetMap (OSM) vector data embeddings. It conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text to synthesize realistic aerial imagery.
+## Model Description
+VectorSynth uses a two-stage pipeline:
+1. **RenderEncoder**: Projects 768-dim CLIP text embeddings of OSM text to 3-channel control images
+2. **ControlNet**: Conditions Stable Diffusion 2.1 on the rendered control images
+This model uses standard CLIP embeddings. For the COSA embedding variant, see [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA).
+## Usage
+```python
+import torch
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, DDIMScheduler
+from huggingface_hub import hf_hub_download
+device = "cuda"
+# Load ControlNet
+controlnet = ControlNetModel.from_pretrained("MVRL/VectorSynth", torch_dtype=torch.float16)
+# Load pipeline
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-2-1-base",
+    controlnet=controlnet,
+    torch_dtype=torch.float16
+)
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+pipe = pipe.to(device)
+# Load RenderEncoder
+render_path = hf_hub_download("MVRL/VectorSynth", "render_encoder/clip-render_encoder.pth")
+checkpoint = torch.load(render_path, map_location=device, weights_only=False)
+render_encoder = checkpoint['model'].to(device).eval()
+# Your hint tensor should be (H, W, 768) - per-pixel CLIP embeddings of OSM text
+# hint = torch.load("your_hint.pt").to(device)
+# hint = hint.unsqueeze(0).permute(0, 3, 1, 2)  # (1, 768, H, W)
+# with torch.no_grad():
+#     control_image = render_encoder(hint).sigmoid()
+# Generate
+# output = pipe(
+#     prompt="Satellite image of a city neighborhood",
+#     image=control_image,
+#     num_inference_steps=40,
+#     guidance_scale=7.5
+# ).images[0]
+```
+## Files
+- `config.json` - ControlNet configuration
+- `diffusion_pytorch_model.safetensors` - ControlNet weights
+- `render_encoder/clip-render_encoder.pth` - RenderEncoder weights
+- `render.py` - RenderEncoder class definition
+## Citation
+```bibtex
+@inproceedings{cher2025vectorsynth,
+  title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
+  author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
+  year={2025},
+  eprint={arXiv:2511.07744},
+  note={arXiv preprint}
+}
+```
+## Related Models
+- [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA) - COSA embedding variant
+- [GeoSynth](https://huggingface.co/MVRL/GeoSynth) - Text-to-satellite image generation