Instructions to use SagiPolaczek/LTX-2.3-Sync-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use SagiPolaczek/LTX-2.3-Sync-LoRA with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-2", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("SagiPolaczek/LTX-2.3-Sync-LoRA") prompt = "A man with short gray hair plays a red electric guitar." input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png") image = pipe(image=input_image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
LTX-2.3 Sync-LoRA (3d1t, rank 256)
An In-Context LoRA (IC-LoRA) for LTX-2.3 (22B) that performs first-frame-driven video editing.
Given:
- a reference video (the motion / identity to preserve), and
- an edited first frame (a single image showing the desired edit applied to frame 0),
the model generates the full edited video β the edit from the first frame is propagated and kept in sync with the reference video's motion across all frames.
This is the rank-256 3d1t variant of the Sync-LoRA.
β‘ The prompt: use the token 3d1t
This LoRA was trained with a constant caption β the literal token 3d1t β for every sample.
The edit is driven entirely by the first-frame image and the reference video, not by a text
description. So at inference:
Always set the text prompt to exactly
3d1t.
Do not write a descriptive prompt (e.g. "a person with red hair"); describe the edit by supplying
the edited first frame instead. 3d1t is a "special edit token" telling the model to edit via the
first frame + reference.
How it works (conditioning)
| Input | How it's wired |
|---|---|
| Edited first frame (image) | image conditioning at latent index 0 (VideoConditionByLatentIndex, strength 1.0) β replaces frame 0 |
| Reference video | IC-LoRA reference conditioning (VideoConditionByReferenceLatent, strength 1.0) |
| Text prompt | the constant token 3d1t |
Video-only (no audio).
Training details
- Base model: LTX-2.3 (22B), dev checkpoint
- Type: IC-LoRA, rank = alpha = 256
- Resolution / length (training): 512Γ512, 81 frames, 25 fps
- Caption: constant token
3d1t(text conditioning effectively removed) - File:
ltx-2.3-sync-lora-3d1t-r256.safetensors(ComfyUI-style keys,diffusion_model.prefix), step 5000
Inference
Use the LTX-2 ltx-pipelines IC-LoRA, two-stage distilled
pipeline (stage 1 at half resolution β Γ2 spatial upscale β stage 2 refine).
Important for LTX-2.3: distillation is shipped as a LoRA, so stack the LTX-2.3
distilled-lora-384 together with this Sync-LoRA on both stages (8-step stage 1 + 3-step stage 2).
You will need (from the LTX-2.3 release):
ltx-2.3-22b-dev.safetensors(base)ltx-2.3-22b-distilled-lora-384-1.1.safetensors(distillation LoRA)ltx-2.3-spatial-upscaler-x2-1.1.safetensors(stage-2 upscaler)- the Gemma text encoder
CLI (sketch)
python -m ltx_pipelines.ic_lora \
--distilled-checkpoint-path ltx-2.3-22b-dev.safetensors \
--spatial-upsampler-path ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
--gemma-root path/to/gemma \
--lora ltx-2.3-sync-lora-3d1t-r256.safetensors 1.0 \
--lora ltx-2.3-22b-distilled-lora-384-1.1.safetensors 1.0 \
--prompt "3d1t" \
--video-conditioning reference.mp4 1.0 \
--images edited_first_frame.png 0 1.0 \
--height 1024 --width 1024 --num-frames 81 --frame-rate 25 --seed 42 \
--output-path out.mp4
Notes:
--prompt "3d1t"(the token) β required.--images <png> 0 1.0puts the edited frame at index 0;--video-conditioning <mp4> 1.0is the reference.- Stage 1 runs at half the requested resolution, so
--height/--width 1024β stage-1 512 (the training resolution). Resolution must be divisible by 64; frames must satisfyframes % 8 == 1. - To match an input clip's duration, set
--num-frames/--frame-rateaccordingly (e.g. a 5.1 s, 30 fps clip β--num-frames 153 --frame-rate 30). Non-square aspect ratios (e.g. portrait 768Γ1024) work and avoid cropping a portrait input. - On LTX-2.3, stack the
distilled-lora-384on both stages (the stock pipeline leaves stage 2 LoRA-free, which expects an already-fused distilled checkpoint).
Python (building blocks)
from ltx_core.loader import LTXV_LORA_COMFY_RENAMING_MAP, LoraPathStrengthAndSDOps
from ltx_pipelines.ic_lora import ICLoraPipeline
sync = LoraPathStrengthAndSDOps("ltx-2.3-sync-lora-3d1t-r256.safetensors", 1.0, LTXV_LORA_COMFY_RENAMING_MAP)
distilled = LoraPathStrengthAndSDOps("ltx-2.3-22b-distilled-lora-384-1.1.safetensors", 1.0, LTXV_LORA_COMFY_RENAMING_MAP)
pipe = ICLoraPipeline(
distilled_checkpoint_path="ltx-2.3-22b-dev.safetensors",
spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
gemma_root="path/to/gemma",
loras=[sync, distilled],
)
video, _ = pipe(
prompt="3d1t", # the token
seed=42, height=1024, width=1024, num_frames=81, frame_rate=25,
images=[("edited_first_frame.png", 0, 1.0)], # edit at frame 0
video_conditioning=[("reference.mp4", 1.0)], # reference video
)
Limitations
- Trained at 512Γ512 / 81 frames; other resolutions and lengths work but are out of the training distribution and may degrade.
- The text branch is intentionally inert β only
3d1twas ever seen during training.
- Downloads last month
- 237
Model tree for SagiPolaczek/LTX-2.3-Sync-LoRA
Base model
Lightricks/LTX-2