LTX-2.3 22B "Helium": an audio-only IC-LoRA (experimental, proof of concept)
Last updated: 2026-05-31
An IC-LoRA for LTX-2.3-22B (distilled) where the in-context reference is only audio: no image, no video, no init frame. The model still generates audio and video jointly, but the only thing you hand it as a reference is an audio file. I trained it on a narrow, measurable task (a voiced tone is meant to steer the pitch of the generated speech, the "helium" idea), and in practice the audio reference visibly steers the whole output: a voice or a music track shifts the speaker's identity, the setting, even the era. That broader transfer is the interesting part. The pitch task was just a clean probe for whether an audio-only reference could do anything at all.
Status: evaluated a few different ways, and it clearly produces interesting effects, the audio reference steers speaker identity, scene, and era. It has not been through a rigorous, controlled test, and the specific pitch-tracking it was trained for is not cleanly confirmed. Treat the examples below as illustrative single takes, not benchmarks.
Model Files
lora_weights_step_02000.safetensors: the final checkpoint (2000 steps).lora_weights_step_01500.safetensors: an earlier checkpoint, about 8% apart by weight norm, included to compare. Both are clean (960 finite bf16 tensors each).ltx2_audio_reference.yaml: the training config.data_recipe.md: how the dataset was built.- Training fork: fblissjr/LTX-2 @ audio-guidance-iclora-vtv (the
audio_referencestrategy).
Model Details
- Base Model: LTX-2.3-22B distilled (
ltx-2.3-22b-distilled-1.1), a joint audio-video model (which matters for the caveats). - Control Type: Audio. One in-context audio reference, no image and no video. The model generates both modalities; only the reference is audio-only.
- What's adapted: the audio stack only (
audio_attn1/audio_attn2,audio_ff), nothing on the video or cross-modal side. That audio-only cut is the bet of the experiment. - Reference: appended clean at negative RoPE positions, out-of-timeline context (the ID-LoRA convention). Reference Downscale Factor: not applicable to audio (defaults to 1; the loader's "couldn't find reference_downscale_factor" warning is harmless).
- Format: plain LoRA safetensors, diffusers/PEFT convention (
lora_A/lora_Bunderdiffusion_model., no alpha keys), bf16. ComfyUI's LTX IC-LoRA loader reads it directly, no conversion.
Full hyperparameters are in Training Details below.
Using in ComfyUI
- Copy the LoRA into
models/loras. - Install the custom nodes from ComfyUI-AudioLoopHelper. The audio IC-LoRA loader and the Add Audio IC-LoRA Guide node are not in stock ComfyUI-LTXVideo, so you need this repo.
- Use the example workflow
audio-ic-lora_single-pass.json: load the LoRA with the audio IC-LoRA loader, feed your audio through the Add Audio IC-LoRA Guide node with a neutral caption, and generate audio and video together.
How it works and the eval notes are in docs/audio_iclora.
Usage recommendations
- Strength: start at ~0.5. Higher increases instability, and toward 1.0 the video garbles. The usable ceiling depends on the reference and the generation, so if it breaks up, back off. 0.3 to 0.5 is the working band.
- Caption: keep it neutral and attribute-free, so the audio reference (not the caption) drives the attribute.
- Reference: the reference is audio only. Any audio works; the effect lands as a global attribute (voice, genre, era, scene).
Examples
Single takes: same model, a neutral caption, and the only reference fed in is an audio file. Each clip carries its generated audio, so the effect is audible (use the player controls).
Scene from a restaurant audio creates a restaurant scene Prompt: A person says calmly, "Bandoco is good".
1960s broadcast audio gives the video a vintage look. Prompt: A person says calmly, "Bandoco is good".
Hip-hop audio shifts the speaker and setting. Prompt: A person says, "Hello, how are you doing today?".
A high-tempo dance track speeds up the delivery. Prompt: A person says, "I made an audio IC Lora".
A voiced tone drops the generated speech to a whisper (the trained "helium" task). Prompt: A person says, "I made an audio IC Lora".
Training Details
| Parameter | Value |
|---|---|
| Base model | LTX-2.3-22B distilled (ltx-2.3-22b-distilled-1.1) |
| Training framework | ltx-trainer (Lightricks), audio-only fork |
| Training strategy | IC-LoRA (audio_reference) |
| Checkpoints | step 2000 (final), step 1500 (comparison) |
| LoRA rank / alpha | 32 / 32, no dropout |
| Target modules | audio_attn1, audio_attn2 (to_k/q/v/out), audio_ff.net.0.proj, audio_ff.net.2 (all 48 blocks) |
| Learning rate | 2e-4, linear schedule, adamw8bit |
| Mixed precision | bf16 |
| Quantization | int8-quanto + block-swap (fits a single 24 GB 4090) |
| Batch size | 1 (gradient checkpointing) |
| Steps | 2000 |
| Training dataset | 292 audio-reference / audio-video-target pairs |
| Resolution | 256x256, 121 frames, 25 fps |
| Reference | 2-second voiced tone, appended clean at negative RoPE positions |
| First-frame conditioning | 0.0 (dropped, so the reference is the only non-caption input) |
Dataset
For the audio reference to be in control (not the caption), the reference and target must differ in content and share only the attribute. So the reference is a bare voiced tone, the target is real talking-head speech shifted to that pitch, and the caption is a constant pitch-free string. Take real dialogue video clips with audio, measure each clip's natural pitch, shift its speech a few semitones (capped at ±7) while keeping timing so lips still line up, render at 256x256, and synthesize a 2-second tone at that pitch with timbre varied independently so the model cannot read timbre instead. Caption always a person speaking. Kept a pair only if the shifted pitch re-measured within 30 Hz of target: 292 clean pairs, 76 to 353 Hz.
The hole: the target pitch is a bounded shift off each clip's own pitch, so the two correlate about 0.75, and the speaker's face also hints at their natural pitch. So the tone is not the only route to the answer. That does not wreck a positive result (sweep the tone with everything else fixed and watch the pitch follow), but it makes a null ambiguous. The fix is a decorrelated rerun (random target pitch, formant-preserving shift); that is the next dataset.
Caveats
- The pitch-tracking eval is not a clean win. A controlled reference-swap (fixed prompt and seed, sweep the tone) came back noisy: the base arm leaks the reference tone into the measured audio and the LoRA arm often comes out unvoiced, so the with-vs-without pitch slope is not trustworthy yet. The transfer effect in the examples is obvious; the narrow pitch number is not.
- The base model is trained on audio and video jointly. By holding video back (audio-only reference, audio-only adaptation) I am deliberately isolating the audio path, and it is genuinely possible that is the wrong cut and pitch needs the video in the loop. A null would not prove audio-only is impossible; the next lever is the cross-modal bridge modules (
audio_to_video_attn,video_to_audio_attn) the identity-transfer ID-LoRA uses. - Where this could go if pursued: voice / accent / emotion transfer, audio-driven style and scene. This was just a POC to see if it works. It seems to?
Citation
The trick of putting the audio reference at negative RoPE positions comes from Lightricks' LipDub and the ID-LoRA work below. My read on how the conditioning behaves is from poking at it in ComfyUI, not a clean ablation, so take the mechanism explanations as informed guesses.
@misc{dahan2026idlora,
title = {ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA},
author = {Dahan, Aviad and Yanuka, Moran and Kraicer, Noa and Wolf, Lior and Giryes, Raja},
year = {2026},
eprint = {2603.10256},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
Acknowledgments
Thanks to WepeNerd (HF) for the whole idea of an audio-only IC-LoRA and for thinking through how an IC-LoRA learns (including "helium"); to the LTX-2 community trainer, Musubi Tuner, and Kijai's work for the infra; and to Throttlekitty for pulling me into LTX-2 in the first place.
License
Base model is by Lightricks under the LTX-2 community license. See https://github.com/Lightricks/LTX-2/blob/main/LICENSE for the full terms.
Model tree for fbjr/LTX-2.3-22b-IC-LoRA-Helium
Base model
Lightricks/LTX-2.3