You need to agree to share your contact information to access this model

By clicking "Agree and Access" you acknowledge the Privacy Policy and consent to receive offers and updates. You can unsubscribe at any time.

LTX-2.3 22B IC-LoRA Deblur (v2)

This is a Deblur IC-LoRA trained on top of LTX-2.3-22B, which restores sharpness to out-of-focus / defocused video by conditioning on the blurry clip and regenerating it in sharp focus while preserving the original subject, framing, and scene geometry.

It is based on the LTX-2.3 foundation model.

Model Files

ltx-2.3-22b-ic-lora-deblur-0.9.safetensors

The shipped checkpoint is step 1000. This run was planned for 1500 steps and stopped early at 1000; the quality sweet spot is in the ~800–1000 range, and step 1000 is the recommended default. Earlier checkpoints (steps 100–900) are available from the training run if you want to trade restoration strength for a gentler effect.

Model Details

Base Model: LTX-2.3-22B Video
Training Type: IC-LoRA (video-to-video, paired reference→target)
Control Type: Defocus/out-of-focus blur — the model conditions on a blurry reference video and outputs the sharp version
Reference Downscale Factor: 1 (the reference is processed at the same resolution as the output)
Pipeline details: No special pre/post color transform. Reference (blurry) and target (sharp) share identical content; only focus/sharpness differ.

Intended Use & Out-of-Scope

Intended use: Recovering sharpness from genuinely out-of-focus or softly defocused footage — landscape and portrait, mixed real-world content (people, wildlife, nature, cities, food, night). Designed to be driven by the production IC-LoRA video-to-video inference pipeline at native 1080p.

Out of scope: Motion-blur removal (the dataset contains no temporal/motion blur), heavy compression-artifact repair, denoising, or super-resolution of already-sharp footage. Extreme blur where the underlying content is essentially destroyed will be hallucinated rather than faithfully reconstructed.

Control Signal Requirements

Control signal type: Spatial defocus blur (the degradation the model inverts).
Expected input: A single video clip — the blurry footage — supplied as the IC-LoRA reference.
Preprocessing: None. Feed the blurry video directly; no extractor, mask, or normalization is required.
Alignment: The reference drives content directly. Best results when the reference is run through the standard IC-LoRA pipeline at the trained bucket (960×544, 121 frames @ 24 fps); the production pipeline handles res/length bucketing.
Mask support: Not supported — the effect is applied to the whole frame.

How It Works

The IC-LoRA conditions on the reference (blurry) video's latents together with a dual-panel "DEBLUR" prompt that describes the scene and asks for the same scene in sharp focus. Because the reference stays attached for the entire denoise (stage-1-only inference, see Usage), subject identity, framing, and background geometry are preserved while focus and sharpness are restored. The trained convention is a two-part caption:

Reference shows <scene description>, heavily out of focus with soft defocused blur and no fine detail. Edited shows the same scene in sharp focus with crisp detail and clean edges. DEBLUR <scene description> Subject identity, framing, and background geometry are identical to the reference; only focus and sharpness differ between reference and edited.

Usage

🔌 ComfyUI

Copy ltx-2.3-22b-ic-lora-deblur-0.9.safetensors into models/loras.
Load the LTX-2.3-22B base model and add the LoRA.
Use an IC-LoRA (video-to-video) workflow from the LTX-2 ComfyUI repository, which wires the reference/guide nodes correctly. Connect the blurry clip as the reference/control video.
Start at LoRA strength 1.0 and lower toward 0.8 if the output over-sharpens (haloing/ringing).

Production pipeline (recommended)

Evaluate and ship with the IC-LoRA video-to-video pipeline (python -m ltx_pipelines.ic_lora) using the identity-safe stage-1-only native hi-res recipe: it renders on a 2× canvas and decodes the half-canvas as the final 1920×1088, keeping the reference attached for the whole denoise so both identity and sharpness hold at full resolution. The dev-trained LoRA loads cleanly onto the ltx-2.3-22b-distilled-1.1 inference base. Avoid the trainer's basic scripts/inference.py for production output, and avoid the two-stage path for identity-critical clips.

Recommended Settings

LoRA strength / weight: 1.0 (sweep 0.5–1.0 if it over-modifies — oversaturation, baked-in artifacts, or haloing).
Resolution & frames: Trained at 960×544 (landscape and portrait), 121 frames @ 24 fps; generates well at native 1920×1088 via the stage-1-only pipeline.
Prompting: Follow the trained DEBLUR dual-panel convention above. The reference video does most of the work; the prompt mainly anchors the scene and the "sharp focus, crisp detail, clean edges" intent.
Suggested negative prompt: worst quality, blurry, out of focus, defocused, soft, hazy, smeared, low detail, jittery, distorted, oversharpened, haloing, ringing (used during training validation; note the production distilled pipeline does not take a negative prompt).

References

Code: GitHub Repository
ComfyUI: ComfyUI-LTXVideo
IC-LoRA docs: IC-LoRA usage guide

Tips & Troubleshooting

Over-sharpening / ringing or halos: lower --lora-strength toward 0.8.
Effect looks weak at 1080p: lower the native generation resolution (e.g. --width 1536 --height 896) closer to the training bucket.
Identity drift at high res: use the stage-1-only default rather than the two-stage path — stage 2 has no reference anchor and drifts on identity-critical content.
Motion blur not removed: expected — the model was trained only on spatial defocus, not temporal/motion blur.

Dataset

The model was trained on a proprietary dataset of 500 (blurry → sharp) video pairs built specifically for in-context deblur training (details below).

Dataset construction (v2)

Motivation. The v1 deblur dataset applied a single degradation recipe to every clip (boxblur + a light gblur). That cheap disc-defocus look was too synthetic — the LoRA learned to invert that specific filter rather than real optical blur and generalized poorly to genuine out-of-focus footage. v2 spans three blur families at varied strengths so the model sees the full "blurry → sharp" distribution it will be asked to invert.

Pairs are built for IC-LoRA training as:

target (videos/): the sharp original clip
reference (references/): the same clip degraded with one blur style + strength

Source footage. 5-second clips at native resolution — a deliberate mix of 4K and 1080p, landscape and portrait (kept native; the trainer's resolution bucketing handles downscaling). 395 clips reused from an existing stock pool plus 150 new Pexels clips across 8 themes (city, nature, ocean, people, food, wildlife, portraits, night), deduplicated, trimmed to exactly 5 s (libx264 -crf 18, audio stripped, yuv420p). Combined into a 545-clip pool; the build draws 500.

Composition (500 clips).

Style	Count	Degradation
`box`	150	`boxblur=lr=L,gblur=sigma=1` — flat disc defocus (the v1 look, retained for coverage)
`gauss`	150	`gblur=sigma=S` — plain gaussian blur
`disk`	200	Physically realistic lens defocus (largest share — highest fidelity)

Within each style, clips are split evenly across four strength tiers (light / medium / heavy / extreme).

Resolution-scaled strength. A fixed pixel radius blurs a 4K frame far less (perceptually) than a 1080p one. Every strength is anchored at a 1080p long edge (1920 px) and scaled per clip by long_edge / 1920, so a light 4K clip gets ~2× the pixel radius of a light 1080p clip and the two look perceptually equivalent.

Tier	`box` boxblur `lr`	`gauss` `sigma`	`disk` radius (px)
light	6	3	8
medium	12	6	14
heavy	20	10	20
extreme	30	16	28

Realistic disk defocus. Rather than an ffmpeg filter, each frame is convolved with a uniform circular kernel (the optical circle of confusion): cropdetect excludes letterbox/pillarbox bars so they don't smear in; convolution is done in linear light (sRGB → linear → convolve → sRGB) to avoid muddy gamma-space blur; BORDER_REPLICATE avoids dark edge halos; and it is purely spatial (no temporal blur, so no motion ghost-trails). Frames are streamed raw (bgr24) out of ffmpeg, processed with NumPy/OpenCV, and piped back into libx264, preserving source resolution, frame rate, and frame count.

Reproducibility & parity. A seeded RNG (seed 42) shuffles the pool, partitions it into per-style counts (150/150/200), and assigns the four strength tiers round-robin within each style, recording every assignment to recipes.json (resumable). All references encode with libx264 -crf 18 -preset slow -pix_fmt yuv420p, audio stripped, preserving source resolution/frame-count/pixel-format. A verification pass confirmed 500/500 pairs valid, 0 mismatches. Captions are generated on the training machine (visual-only) and merged into dataset.json before training. After preprocessing, 1 clip was filtered for insufficient frames, leaving 499 valid training pairs.

Training

Technique: IC-LoRA (rank 128, alpha 128, dropout 0.05) on the DiT transformer, targeting attn1 (self-attention to_q/k/v/out.0) + FFN (ff.net.0.proj, ff.net.2); cross-attention (attn2) intentionally not targeted.
Hyperparameters: bf16 mixed precision, AdamW, learning rate 1.5e-4, cosine schedule, max_grad_norm 1.0, gradient checkpointing on, first_frame_conditioning_p 0.15, shifted-logit-normal flow-matching timestep sampling.
Resolution / data: preprocessed at 960×544 (landscape + portrait, 121/97/89-frame buckets), 499 valid (blurry→sharp) pairs.
Steps: planned 1500, stopped at step 1000 (recommended checkpoint); checkpoints saved every 100 steps.
Infrastructure: LTX-2 Community Trainer, DDP across 8× NVIDIA H100.