LTX-2.3 Sync-LoRA (3d1t, rank 256)

An In-Context LoRA (IC-LoRA) for LTX-2.3 (22B) that performs first-frame-driven video editing.

Given:

a reference video (the motion / identity to preserve), and
an edited first frame (a single image showing the desired edit applied to frame 0),

the model generates the full edited video — the edit from the first frame is propagated and kept in sync with the reference video's motion across all frames.

This is the rank-256 3d1t variant of the Sync-LoRA.

⚡ The prompt: use the token `3d1t`

This LoRA was trained with a constant caption — the literal token 3d1t — for every sample. The edit is driven entirely by the first-frame image and the reference video, not by a text description. So at inference:

Always set the text prompt to exactly 3d1t.

Do not write a descriptive prompt (e.g. "a person with red hair"); describe the edit by supplying the edited first frame instead. 3d1t is a "special edit token" telling the model to edit via the first frame + reference.

How it works (conditioning)

Input	How it's wired
Edited first frame (image)	image conditioning at latent index 0 (`VideoConditionByLatentIndex`, strength 1.0) — replaces frame 0
Reference video	IC-LoRA reference conditioning (`VideoConditionByReferenceLatent`, strength 1.0)
Text prompt	the constant token `3d1t`

Video-only (no audio).

Training details

Base model: LTX-2.3 (22B), dev checkpoint
Type: IC-LoRA, rank = alpha = 256
Resolution / length (training): 512×512, 81 frames, 25 fps
Caption: constant token 3d1t (text conditioning effectively removed)
File: ltx-2.3-sync-lora-3d1t-r256.safetensors (ComfyUI-style keys, diffusion_model. prefix), step 5000

Inference

Use the LTX-2 ltx-pipelines IC-LoRA, two-stage distilled pipeline (stage 1 at half resolution → ×2 spatial upscale → stage 2 refine).

Important for LTX-2.3: distillation is shipped as a LoRA, so stack the LTX-2.3 distilled-lora-384 together with this Sync-LoRA on both stages (8-step stage 1 + 3-step stage 2).

You will need (from the LTX-2.3 release):

ltx-2.3-22b-dev.safetensors (base)
ltx-2.3-22b-distilled-lora-384-1.1.safetensors (distillation LoRA)
ltx-2.3-spatial-upscaler-x2-1.1.safetensors (stage-2 upscaler)
the Gemma text encoder

CLI (sketch)

python -m ltx_pipelines.ic_lora \
  --distilled-checkpoint-path  ltx-2.3-22b-dev.safetensors \
  --spatial-upsampler-path     ltx-2.3-spatial-upscaler-x2-1.1.safetensors \
  --gemma-root                 path/to/gemma \
  --lora ltx-2.3-sync-lora-3d1t-r256.safetensors 1.0 \
  --lora ltx-2.3-22b-distilled-lora-384-1.1.safetensors 1.0 \
  --prompt "3d1t" \
  --video-conditioning reference.mp4 1.0 \
  --images edited_first_frame.png 0 1.0 \
  --height 1024 --width 1024 --num-frames 81 --frame-rate 25 --seed 42 \
  --output-path out.mp4

Notes:

--prompt "3d1t" (the token) — required.
--images <png> 0 1.0 puts the edited frame at index 0; --video-conditioning <mp4> 1.0 is the reference.
Stage 1 runs at half the requested resolution, so --height/--width 1024 → stage-1 512 (the training resolution). Resolution must be divisible by 64; frames must satisfy frames % 8 == 1.
To match an input clip's duration, set --num-frames/--frame-rate accordingly (e.g. a 5.1 s, 30 fps clip → --num-frames 153 --frame-rate 30). Non-square aspect ratios (e.g. portrait 768×1024) work and avoid cropping a portrait input.
On LTX-2.3, stack the distilled-lora-384 on both stages (the stock pipeline leaves stage 2 LoRA-free, which expects an already-fused distilled checkpoint).

Python (building blocks)

from ltx_core.loader import LTXV_LORA_COMFY_RENAMING_MAP, LoraPathStrengthAndSDOps
from ltx_pipelines.ic_lora import ICLoraPipeline

sync = LoraPathStrengthAndSDOps("ltx-2.3-sync-lora-3d1t-r256.safetensors", 1.0, LTXV_LORA_COMFY_RENAMING_MAP)
distilled = LoraPathStrengthAndSDOps("ltx-2.3-22b-distilled-lora-384-1.1.safetensors", 1.0, LTXV_LORA_COMFY_RENAMING_MAP)

pipe = ICLoraPipeline(
    distilled_checkpoint_path="ltx-2.3-22b-dev.safetensors",
    spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
    gemma_root="path/to/gemma",
    loras=[sync, distilled],
)
video, _ = pipe(
    prompt="3d1t",                                   # the token
    seed=42, height=1024, width=1024, num_frames=81, frame_rate=25,
    images=[("edited_first_frame.png", 0, 1.0)],     # edit at frame 0
    video_conditioning=[("reference.mp4", 1.0)],     # reference video
)

Limitations

Trained at 512×512 / 81 frames; other resolutions and lengths work but are out of the training distribution and may degrade.
The text branch is intentionally inert — only 3d1t was ever seen during training.

Downloads last month: 237

Model tree for SagiPolaczek/LTX-2.3-Sync-LoRA

Base model

Lightricks/LTX-2

Adapter

(54)

this model