LTX-2.3-22B IC-LoRA: Audio-Only In-Context Context (experimental, proof of concept)

Last updated: 2026-06-01

An IC-LoRA for LTX-2.3-22B (distilled) where the in-context reference is only audio: no image, no video, no init frame. The model still generates audio and video jointly; the only thing you hand it as a reference is an audio file. We trained it on natural talking-head clips to map a speaker's voice to their identity, and in practice the audio reference visibly steers the generated speech and the speaker's on-screen mannerisms and energy. It is the successor to the pitch "Helium" proof-of-concept, this time on natural (un-perturbed) data and with the cross-modal bridge added.

Status: early. The transfer is clearly visible in single takes, but it has not been through a controlled, quantitative eval, and the narrow trained voice-to-identity mapping is not yet cleanly separated from broad emergent audio-to-AV transfer. Treat the examples as illustrative single takes, not benchmarks. This needs more evals. If anyone wants to test and see what interesting things come, please do so and share! The point of releasing it is the process writeup below, so others can reproduce, adapt it to a different task, or continue where we left off.

Model files (two checkpoints, same recipe, different cut)

Both are 1000-step adapters trained identically except for which modules they adapt:

  • cross_modal_step_01000.safetensors (~290 MB, 1728 bf16 tensors): adapts the audio stack and the cross-modal bridges (audio_to_video_attn, video_to_audio_attn). The bridge gives the audio reference a direct path into the video stream, so its effect on the video (mannerisms, energy, expression timing) is stronger. The default pick if you want the audio to move the picture.
  • audio_only_step_01000.safetensors (~164 MB, 960 tensors): adapts the audio stack only (audio_attn1/2, audio_ff), no cross-modal. The reference shapes the generated audio; the video follows via the frozen base coupling, so the video effect is subtler / more emergent and the base video path is perturbed the least.

Each seems to have its own strengths and weaknesses; we have not pinned down which is better for which use (that is one of the open questions below). Try both.

What it is

  • Base: LTX-2.3-22B distilled (ltx-2.3-22b-distilled-1.1), a joint audio-video model (this matters for the caveats). It is guidance-distilled, so it runs at CFG = 1 (no classifier-free guidance; strength is the only inference knob).
  • Control: one in-context audio reference, appended clean at negative RoPE positions as out-of-timeline context (the ID-LoRA convention). No image, no video, no init frame. The model generates both modalities; only the reference is audio-only.
  • Format: plain LoRA safetensors, PEFT convention (lora_A/lora_B under diffusion_model., no alpha keys), bf16. ComfyUI's LTX IC-LoRA loader reads it directly, no conversion.

Using in ComfyUI

  1. Copy the LoRA into models/loras.
  2. Install the custom nodes from ComfyUI-AudioLoopHelper (recommended). They add a key-match trust gate (a mis-bound LoRA fails loud instead of silently doing nothing), raw-audio convenience, a per-stream loader (separate strength for the cross-modal bridge vs the audio modules), and an Advanced guide (reference window + scale). They are not strictly required β€” the core path also runs on stock nodes (see "Using without the custom nodes" below).
  3. Load the LoRA, feed your audio through the guide node with a neutral caption, and generate audio and video together from an empty latent (text-to-video; no image or video input). Example workflow: audio-ic-lora_single-pass.json.

Usage notes

  • Strength: start ~0.5. Both checkpoints garble at high strength; the working band is roughly 0.3 to 0.75 and the ceiling depends on the reference and the generation. If it breaks up, back off.
  • With the per-stream loader you can push bridge_strength above audio_strength to amplify the audio-to-video coupling specifically.
  • Caption: keep it neutral and attribute-free, so the audio reference (not the caption) drives the attribute.
  • The reference audio matters a lot, in ways we do not fully understand yet β€” level/loudness, quality, and content all seem to change the result. Expect to experiment.
  • Audio VAE precision: nothing to set in ComfyUI. ComfyUI runs the LTX audio VAE in fp32 automatically (it forces float32 for this VAE; the weights ship bf16 and get upcast), which matches the training encode β€” so the bf16 filename/loader widget doesn't change this. The video VAE runs bf16, also what training used. (You'd only need to handle fp32 yourself if you encode the reference outside ComfyUI, since the weights ship bf16.)

Using without the custom nodes

You do not strictly need this repo β€” our loader and guide are instrumented twins of stock nodes, so the core path runs on stock ComfyUI-LTXVideo + core ComfyUI. This stock path is untested for this model and may not behave the same way as the custom nodes. The attach and the LoRA load are the same code underneath, so it should match β€” but if it appears to do nothing (no change to the generated audio), the usual cause is that the native loader silently no-op'd a LoRA that bound nothing and gave no warning. So if you go this route: make sure the LoRA is loaded onto LTX-2.3-22B distilled (the exact base it was trained on) and confirm it actually has an effect. Our loader fails loud when nothing binds; the native one does not. With that said:

  1. Load the LoRA with LTX IC-LoRA Loader (Model Only) (LTXICLoRALoaderModelOnly), strength ~0.5–0.9. One strength for all modules β€” you lose the cross-modal bridge's separate strength (the only thing the per-stream loader adds).
  2. Load the audio VAE (LTXV Audio VAE Loader) and encode the reference (LTXV Audio VAE Encode). Trim your clip to ~3.5 s first β€” stock has no windowing.
  3. Attach it with LTXV Set Audio Ref Tokens (LTXVSetAudioRefTokens) onto the positive + negative conditioning. This is byte-identical to our guide's attach; the model applies the negative-RoPE offset either way.
  4. Generate audio + video from empty latents (Empty LTXV Latent Video + LTXV Empty Latent Audio) with a neutral caption, CFG = 1, and the usual LTX-2 distilled sampler.

Two things to watch for parity: (a) the reference encode should run in fp32 β€” ComfyUI does this automatically for the LTX audio VAE (it forces float32; the bf16 weights are upcast), so there is nothing to set in-graph. You only need to ensure fp32 yourself if you encode the reference outside ComfyUI. (b) keep the reference ~3.5 s (the training length). What you give up vs the custom nodes: the trust gate (stock silently no-ops on the wrong base, so confirm the LoRA actually has an effect), the per-stream bridge strength, and the reference window/shaper.

Examples

Six single takes β€” three references (a dialogue clip, a music clip, and a spoken-word clip), each fed into both checkpoints. For the first two the seed is fixed across the pair, so the checkpoint (and its strength) is the only thing that changes β€” a like-for-like look; the spoken-word pair uses different seeds. Audio-only input (no image or video; the video is generated from noise). Each clip carries its generated audio (use the player controls). The references are kept generic on purpose.

T2V prompt (all six): a fixed restaurant-dialogue scene beginning In a medium shot at a dim restaurant table, an animated man leans in, grinning, and says ....

Reference 1 β€” a dialogue clip (warning: profanity).

Audio-only checkpoint (audio_only_step_01000), audio strength 0.9:

Cross-modal checkpoint (cross_modal_step_01000), audio strength 0.9, bridge strength 0.5 (per-stream loader):

Reference 2 β€” a music clip (warning: profanity).

Audio-only checkpoint (audio_only_step_01000), audio strength 0.75:

Cross-modal checkpoint (cross_modal_step_01000), strength 0.75:

Reference 3 β€” a spoken-word clip (warning: profanity).

Audio-only checkpoint (audio_only_step_01000), audio strength 0.6:

Cross-modal checkpoint (cross_modal_step_01000), audio strength 0.6, bridge strength 0.9 (per-stream loader):

Training

Parameter Value
Base model LTX-2.3-22B distilled (ltx-2.3-22b-distilled-1.1)
Framework ltx-trainer (Lightricks), audio-only fork
Strategy IC-LoRA (audio_reference)
LoRA rank / alpha / dropout 32 / 32 / 0.05
Target modules (cross-modal) audio_attn1, audio_attn2 (to_k/q/v/out), audio_ff.net.0.proj, audio_ff.net.2, audio_to_video_attn, video_to_audio_attn (all blocks)
Target modules (audio-only) the same, minus the two cross-modal bridges
Learning rate / schedule 2e-4, cosine with 50-step warmup (eta_min 2e-5), adamw8bit
Mixed precision / quant bf16, int8-quanto + block-swap (fits a single 24 GB 4090)
Steps / batch 1000 / 1 (gradient checkpointing)
Resolution 512x512, 121 frames, 25 fps
Reference 3.5 s voice window from a different clip of the same speaker, appended clean at negative RoPE positions
First-frame conditioning 0.0 (the reference is the only non-caption input)
Held-out validation by identity; held-out loss + a reference-attribution gap monitored during training

Throughput is about 11.5 s/step on a 4090; batch_size > 1 OOMs at 512x512 (block-swap is already near its floor), so single-sample steps are the practical ceiling at this resolution.

Dataset (brief)

Natural talking-head clips with audio, un-perturbed (unlike the pitch model). For each target clip the reference is a 3.5 s voice window from a different clip of the same speaker, so the reference and target share the speaker but not the content. The caption is a constant neutral string, so the voice, not the caption, is meant to carry the identity. Split is held out by identity. Full recipe: data_recipe.md.

What we learned, what we are still unsure of, and how to take it further

What seems to work

  • An audio-only in-context reference does steer a distilled, joint audio-video model: at default strength, with no special knobs, the reference moves the generated speech and the speaker's mannerisms/energy. Confirmed only as single takes so far.
  • The cross-modal bridge is the path the audio reference uses to reach the video. Adapting it (the cross-modal checkpoint) couples the audio's influence into the video more than the audio-only cut, which leaves voice-to-video to the frozen base.
  • The base generates audio and video jointly, so the video tends to follow the generated audio. The reference's clearest, most reliable effect is on the audio; the video effect rides on that coupling.

What we are still unsure of

  • Whether the narrow trained mapping (this voice -> this person's face) works cleanly, versus the broad emergent audio-to-AV transfer you see when you feed it out-of-distribution audio (songs, movie clips). The examples are the emergent kind; a clean identity test (feed a held-out speaker's voice, check whether the generated face matches that actual person) is still to do.
  • The dependence on the reference audio itself: level/loudness, quality, and content all seem to matter, and we cannot yet say how. This deserves a systematic sweep.
  • Which checkpoint (audio-only vs cross-modal) is better for which use, and the exact strength band before garbling, both of which depend on the audio.

A measurement honesty note (useful if you build observability): during training we watch a held-out loss plus a "reference-attribution gap" (loss with the correct reference minus loss with a wrong one). For an identity task that gap reads ~neutral, and that is expected, not a failure: the target video already carries the identity, so reconstruction does not need the reference. That is exactly why the real test is generation from noise (swap only the reference and watch the output), not a reconstruction-loss metric.

Where we think it could be better next time

  • The target video leaks the attribute during training, so the model has little pressure to use the reference. The most promising next lever is to mask the target face region and/or bias training toward high noise levels, forcing the model to rely on the reference. (We did not do this yet.)
  • More identity / speaker diversity, and a quantitative swap eval: fixed prompt and seed, swap only the reference, measure the change against a seed-swap noise floor, and include non-celebrity / unknown voices so a positive result cannot be the base model recalling a famous face.
  • A real reference-strength dial would need a small model-side change (the audio reference is read with no scalar applied); a timestep-range gate (apply the reference only over the noise band where it bites) works through stock ComfyUI nodes and is the cheapest thing to try.

How to pick it up (the easy button)

  • Reproduce / continue: use the fork above (audio_reference strategy), precompute latents + the reference windows, and run train.py with train_config.yaml. The two checkpoints differ only by the target-module list.
  • Adapt to a different task: the recipe is general. Pick an attribute you want the audio to control, then build pairs where the reference and target share only that attribute and differ in content, with a neutral caption. Voice/accent/emotion/genre/era/scene are all candidates. Same strategy, new data.

Citation

The trick of placing the audio reference at negative RoPE positions comes from Lightricks' LipDub and the ID-LoRA work below. Our read of how the conditioning behaves is from inspecting it in ComfyUI and the model code, not a formal ablation.

@misc{dahan2026idlora,
  title  = {ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA},
  author = {Dahan, Aviad and Yanuka, Moran and Kraicer, Noa and Wolf, Lior and Giryes, Raja},
  year   = {2026},
  eprint = {2603.10256},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SD}
}

Acknowledgments

Thanks to WepeNerd (HF) for the idea of an audio-only IC-LoRA and for thinking through how an IC-LoRA learns; to the LTX-2 community trainer, Musubi Tuner, and Kijai's work for the infrastructure; and to Throttlekitty for pulling me into LTX-2.

License

Base model is by Lightricks under the LTX-2 community license. See https://github.com/Lightricks/LTX-2/blob/main/LICENSE for the full terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for fbjr/LTX-2.3-22b-IC-LoRA-Audio-Only-Context

Adapter
(54)
this model

Paper for fbjr/LTX-2.3-22b-IC-LoRA-Audio-Only-Context