Single Trajectory Distillation for Accelerating Image and Video Style Transfer
Authors: Sijie Xu1, Runqi Wang1,2, Wei Zhu1, Dejia Song1, Nemo Chen1, Xu Tang1, Yao Hu1
Affiliations: 1Xiaohongshu, 2ShanghaiTech University
πΌοΈ Visual Results
Method Overview
Qualitative Comparison
Visual comparison with LCM, TCD, PCM, and other baselines at NFE=8 (CFG=6)
Metric Analysis

Performance under different CFG values (2-8). Our method (red line) achieves optimal style-content balance.
π Quick Start
Inference Demo (Image-to-Image)
# !pip install opencv-python
import torch
from diffusers import StableDiffusionXLImg2ImgPipeline, TCDScheduler
from PIL import Image
device = "cuda"
std_lora_path = "weights/std/std_sdxl_i2i_eta0.75.safetensors"
# Initialize pipeline
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"weights/dreamshaper_XL_v21",
torch_dtype=torch.float16,
variant="fp16"
).to(device)
# Load STD components
pipe.scheduler = TCDScheduler.from_config(
pipe.scheduler.config,
timestep_spacing='leading',
steps_offset=1
)
pipe.load_lora_weights(std_lora_path, adapter_name="std")
pipe.fuse_lora()
# Prepare inputs
prompt = "Stick figure abstract nostalgic style."
n_prompt = "worst face, NSFW, nudity, nipples, (worst quality, low quality:1.4), blurred, low resolution, pixelated, dull colors, overly simplistic, harsh lighting, lack of detail, poorly composed, dark and gloomy atmosphere, (malformed hands:1.4), (poorly drawn hands:1.4), (mutated fingers:1.4), (extra limbs:1.35), (poorly drawn face:1.4), missing legs, (extra legs:1.4), missing arms, extra arm, ugly, fat, (close shot:1.1), explicit content, sexual content, pornography, adult content, inappropriate, indecent, obscene, vulgar, suggestive, erotic, lewd, provocative, mature content"
src_img = Image.open("doc/imgs/src_img.jpg").resize((960, 1280))
style_img = Image.open("doc/imgs/style_img.png")
# Run inference
image = pipe(
prompt=prompt,
negative_prompt=n_prompt,
num_inference_steps=11, # 8 / 0.75 = 11
guidance_scale=6,
strength=0.75,
image=src_img,
ip_adapter_image=style_img,
).images[0]
image.save("std_output.png")
π¦ Model Zoo
We provide pretrained models for both image-to-image and video-to-video tasks with different Ξ· values. All models are hosted on Hugging Face.
Image-to-Image Models
Ξ· Value | Model Link |
---|---|
0.65 | std_sdxl_i2i_eta0.65.safetensors |
0.75 | std_sdxl_i2i_eta0.75.safetensors |
0.85 | std_sdxl_i2i_eta0.85.safetensors |
0.95 | std_sdxl_i2i_eta0.95.safetensors |
Video-to-Video Models
Ξ· Value | Model Link |
---|---|
0.65 | std_sdxl_v2v_eta0.65.safetensors |
0.75 | std_sdxl_v2v_eta0.75.safetensors |
0.85 | std_sdxl_v2v_eta0.85.safetensors |
0.95 | std_sdxl_v2v_eta0.95.safetensors |
π Citation
@article{xu2024single,
title={Single Trajectory Distillation for Accelerating Image and Video Style Transfer},
author={Xu, Sijie and Wang, Runqi and Zhu, Wei and Song, Dejia and Chen, Nemo and Tang, Xu and Hu, Yao},
journal={arXiv preprint arXiv:2412.18945},
year={2024}
}
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for SecondComming/Single-Trajectory-Distillation
Unable to build the model tree, the base model loops to the model itself. Learn more.