SVGDreamer: Text Guided Vector Graphics Generation with Diffusion Model

Community Article Published April 19, 2024

👥 Authors: Ximing Xing, Haitao Zhou, Chuang Wang, Jing zhang, Dong Xu, and Qian Yu
🌟 Status: Accepted by CVPR2024
🔍 Keywords: #SVGDreamer #Text-to-SVG #SVG #Diffusion #CVPR2024

Links:

🔗 arXiv Paper: https://arxiv.org/abs/2312.16476
🌐 Project Page: https://ximinng.github.io/SVGDreamer-project/
📁 Code Repo: https://github.com/ximinng/SVGDreamer

What is the SVGDreamer?

📝 TL;DR: Given a text prompt, SVGDreamer can generate editable and versatile high-fidelity vector graphics.

SVGDreamer text-to-SVG — Figure 1: SVGDreamer: Text-to-SVG

Introduction

Scalable Vector Graphics (SVG) is a fundamental element used to describe 2D graphics and graphic applications. Unlike traditional raster graphics, SVG utilizes mathematical descriptions to define graphics, allowing for lossless scaling at any size without distortion. This makes SVG an ideal choice in website design, particularly in scenarios requiring adaptation to various resolutions and devices. However, manually designing SVG by creators is costly and challenging.

Recently, with the rapid development of CLIP and generative models, text-to-SVG synthesis has made significant progress in fields such as abstract pixel styles [1,2] and vector hand-drawn sketches [3,4]. Driving vector path primitives through differentiable renderers [5] to automatically synthesize corresponding vector graphics has become a popular research direction. Compared to human designers, Text-to-SVG methods can rapidly and massively create vector content, contributing to the expansion of vector assets.

However, existing text-to-SVG methods still face two limitations: 1. Generated vector graphics lack editability; 2. It's challenging to produce high-quality and diverse results. To address these limitations, the authors propose a novel text-guided vector graphics synthesis method: SVGDreamer.

Methodology

SVGDreamer pipline — Figure 2: Overview of SVGDreamer. The method consists of two parts: semantic-driven image vectorization (SIVE) and SVG synthesis through VPSD optimization.

SVGDreamer consists of two parts: Semantic-driven Image Vectorization (SIVE) and Vectorized Particle-based Score Distillation (VPSD). SIVE vectorizes images based on textual prompts, while VPSD synthesizes high-quality, diverse, and aesthetically appealing vector graphics through score distillation from pre-trained diffusion models using vector examples.

SIVE: Semantic-driven Image Vectorization

SIVE synthesizes vector graphics with decoupled semantic hierarchy based on textual prompts. It consists of two parts: 1. Primitive Initialization; 2. Semantic-aware Optimization.

As shown in the upper half of Figure 2, different words in the textual prompts correspond to different attention maps, allowing the authors to initialize the control points of the vector graphics using these attention maps. Specifically, the authors normalize the attention maps and treat them as probability distribution maps. They then sample points on the canvas weighted by the probabilities to serve as control points for Bezier curves.

Subsequently, the authors convert the attention maps obtained during the initialization stage into reusable masks. Parts with values greater than or equal to a threshold are set to 1, representing the target regions, while those below the threshold are set to 0. The authors utilize these masks to define the SIVE loss function, thus optimizing different objects precisely.

$\mathcal{L}_{\mathrm{SIVE}} = \sum_{i}^O \left( \hat{\mathcal{M}}_i \odot I - \hat{\mathcal{M}}_i \odot \mathbf{x} \right)^2$

SIVE ensures that control points remain within their respective semantic object regions, achieving the decomposition of different objects, as depicted in the upper right part of Figure 2.

SVGDreamer vector Assets — Figure 3: Examples of vector assets created by SIVE.

Additionally, we present further examples in Figure 4. These generated SVGs can be decomposed into background and foreground elements, which can then be recombined to create new SVGs.

VPSD: Vectorized Particle-based Score Distillation

Previous SVG generation works based on diffusion models [2,4] have explored the use of Score Distillation Sampling (SDS) to optimize SVG parameters. However, this optimization method often leads to oversaturated colors and excessively smooth SVG results. Inspired by Variational Score Distillation Sampling, the authors propose Vectorized Particle-based Score Distillation (VPSD) loss to address these issues.

Compared to SDS, this sampling method models SVG as a distribution of control points and colors. VPSD optimizes this distribution to achieve optimization of SVG parameters:

$\nabla_{\theta} \mathcal{L}_{\mathrm{VPSD}} (\phi, \phi_\mathrm{est} , \mathbf{x} = \mathcal{R}(\theta)) \triangleq \mathbb{E}_{t,\epsilon,p,c} \left[ w(t) ( \mathbf{\epsilon}_{\phi} (\mathbf{z}_t;y,t) - \mathbf{\epsilon}_{\phi_\mathrm{est}}(\mathbf{z}_t;y,p,c,t) ) \frac{\partial \mathbf{z}}{\partial \theta} \right]$

Due to the high computational cost of directly optimizing another model $\epsilon_{\phi_{est}}$ , LoRA is introduced to reduce the number of parameters being optimized:

$\mathcal{L}_{\mathrm{lora}} = \mathbb{E}_{t,\epsilon,p,c} \left\| \mathbf{\epsilon}_{\phi_\mathrm{est}}(\mathbf{z}_t;y,p,c,t) - \epsilon \right\|_{2}^{2}$

Finally, to enhance the aesthetic evaluation of synthesized vector graphics, the authors introduce a reward feedback learning method (ReFL). This involves inputting the sampled samples into a pre-trained Reward model to collectively optimize the LoRA parameters:

$\mathcal{L}_{\mathrm{reward}} = \lambda \mathbb{E}_{y} \left[ \mathbf{\psi}( r( y, g_{\phi_{\mathrm{est}}}(y) ) ) \right]$

Our final VPSD objective is then defined by the weighted average of the three terms,

$\underset{\theta}{\operatorname{min}} \; \nabla_{\theta} \mathcal{L}_{\mathrm{VPSD}} + \mathcal{L}_{\mathrm{lora}} + \lambda_{\mathrm{r}} \mathcal{L}_{\mathrm{reward}}$

The SVG path parameters are updated through backpropagation, and optimization is completed through iterative loops, resulting in the final outcome.

Qualitative Results

The following image displays SVG results generated by SVGDreamer in six different style types, including Iconography, Pixel-Art, Ink and Wash, Low-poly, Sketch, and Painting styles. Different color suffixes represent different SVG style types, which do not necessarily need to be specified in the prompt but can be achieved through controlling vector graphic primitives.

Application

In addition, the authors demonstrate an application of SVGDreamer: creating vector posters. By converting text into vector form and combining it with generated vector content, aesthetically pleasing poster results can be achieved. Furthermore, compared to generative posters, each part of the vector poster maintains good editability.

Conclusion

In this work, the author has introduced SVGDreamer, an innovative model for text-guided vector graphics synthesis. SVGDreamer incorporates two crucial technical designs: semantic-driven image vectorization (SIVE) and vectorized particle-based score distillation (VPSD), which empower our model to generate vector graphics with high editability, superior visual quality, and notable diversity. SVGDreamer is expected to significantly advance the applications of text-to-SVG models in the design field.

Limitations

The editability of our method is currently constrained by the T2I model we employ, as the proposed SIVE process relies on the attention map generated by the T2I model. However, we anticipate that future advancements in T2I diffusion models will enhance the decomposition capabilities of our model, thus expanding its editability even further.

Furthermore, studying how to automatically determine the number of control points at the SIVE object level is valuable. We believe this will contribute to the advancement of semantically guided image vectorization.

References

Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems(NIPS), 2022.
Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2023.
Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics(TOG), 41(4):1–11, 2022.
Xing X, Wang C, Zhou H, et al. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models[J]. Advances in Neural Information Processing Systems(NIPS), 2023.
Tzu-Mao Li, Michal Lukac, Gharbi Michael and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020.
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations(ICLR), 2023.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote