Single Trajectory Distillation for Accelerating Image and Video Style Transfer

Authors: Sijie Xu¹, Runqi Wang^1,2, Wei Zhu¹, Dejia Song¹, Nemo Chen¹, Xu Tang¹, Yao Hu¹
Affiliations: ¹Xiaohongshu, ²ShanghaiTech University

🖼️ Visual Results

Method Overview

Qualitative Comparison

Visual comparison with LCM, TCD, PCM, and other baselines at NFE=8 (CFG=6)

Metric Analysis

Performance under different CFG values (2-8). Our method (red line) achieves optimal style-content balance.

🚀 Quick Start

Inference Demo (Image-to-Image)

# !pip install opencv-python
import torch
from diffusers import StableDiffusionXLImg2ImgPipeline, TCDScheduler
from PIL import Image

device = "cuda"
std_lora_path = "weights/std/std_sdxl_i2i_eta0.75.safetensors"

# Initialize pipeline
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "weights/dreamshaper_XL_v21", 
    torch_dtype=torch.float16, 
    variant="fp16"
).to(device)

# Load STD components
pipe.scheduler = TCDScheduler.from_config(
    pipe.scheduler.config, 
    timestep_spacing='leading', 
    steps_offset=1
)
pipe.load_lora_weights(std_lora_path, adapter_name="std")
pipe.fuse_lora()

# Prepare inputs
prompt = "Stick figure abstract nostalgic style."
n_prompt = "worst face, NSFW, nudity, nipples, (worst quality, low quality:1.4), blurred, low resolution, pixelated, dull colors, overly simplistic, harsh lighting, lack of detail, poorly composed, dark and gloomy atmosphere, (malformed hands:1.4), (poorly drawn hands:1.4), (mutated fingers:1.4), (extra limbs:1.35), (poorly  drawn face:1.4), missing legs, (extra legs:1.4), missing arms, extra arm, ugly, fat, (close shot:1.1), explicit content, sexual content, pornography, adult content, inappropriate, indecent, obscene, vulgar, suggestive, erotic, lewd, provocative, mature content"
src_img = Image.open("doc/imgs/src_img.jpg").resize((960, 1280))
style_img = Image.open("doc/imgs/style_img.png")

# Run inference
image = pipe(
    prompt=prompt, 
    negative_prompt=n_prompt,
    num_inference_steps=11,  # 8 / 0.75 = 11
    guidance_scale=6,
    strength=0.75,
    image=src_img,
    ip_adapter_image=style_img,
).images[0]

image.save("std_output.png")

📦 Model Zoo

We provide pretrained models for both image-to-image and video-to-video tasks with different η values. All models are hosted on Hugging Face.

Image-to-Image Models

η Value	Model Link
0.65	std_sdxl_i2i_eta0.65.safetensors
0.75	std_sdxl_i2i_eta0.75.safetensors
0.85	std_sdxl_i2i_eta0.85.safetensors
0.95	std_sdxl_i2i_eta0.95.safetensors

Video-to-Video Models

η Value	Model Link
0.65	std_sdxl_v2v_eta0.65.safetensors
0.75	std_sdxl_v2v_eta0.75.safetensors
0.85	std_sdxl_v2v_eta0.85.safetensors
0.95	std_sdxl_v2v_eta0.95.safetensors

📚 Citation

@article{xu2024single,
  title={Single Trajectory Distillation for Accelerating Image and Video Style Transfer},
  author={Xu, Sijie and Wang, Runqi and Zhu, Wei and Song, Dejia and Chen, Nemo and Tang, Xu and Hu, Yao},
  journal={arXiv preprint arXiv:2412.18945},
  year={2024}
}

SecondComming
/

Single-Trajectory-Distillation