LTX-2.3 Whisper & Soft-Spoken Audio LoRA

Base model: LTX-2.3 dev (works on distilled) Β· Type: Audio-style LoRA Β· Rank: 32


What this does

LTX-2.3 can generate dialogue, multi-speaker scenes, and full dynamic range audio including screaming β€” but it cannot whisper. This LoRA adds two quiet vocal registers to the model:

  • Whispering β€” de-voiced, breathy, close-mic delivery
  • Soft-spoken β€” voiced but low-volume, intimate, relaxed

The LoRA targets only the three attention modules that write to the audio branch (audio_attn1, audio_attn2, video_to_audio_attn). Video output is provably unchanged β€” no visual fighting, no style drift.


Usage

Load at strength 1.0. The register is controlled entirely by the manner keyword in your prompt β€” no special strength tuning needed.

Trigger words

Register Female Male
Whispering (woman, whispering) (man, whispering quietly)
Soft-spoken (woman, speaking softly) (man, speaking softly)

Note: Male whisper requires the extra word quietly to tip the model over. (man, whispering) alone produces soft-spoken, not true whisper.

Prompt format

Follow the LTX-2.3 dialogue caption style:

a [scene description], ([gender], [manner]): "[what they say]"

Examples:

a woman sitting close to a microphone in warm dim lighting, (woman, whispering): "close your eyes and listen"

a man at a desk late at night, (man, speaking softly): "I've been thinking about this all day"

a woman doing a skincare routine, (woman, whispering quietly): "this is my favourite step"

Without manner keywords

Using the LoRA without any manner keyword defaults to soft-spoken β€” a subtle volume-softening effect on whatever the base model would have generated. Useful as a gentle "quieter audio" modifier.


Examples

Female β€” whispering & soft-spoken Male β€” soft-spoken

What it can't do

  • No intra-clip register mixing. You can't have one character whisper and another speak normally in the same clip. The register applies to the whole generation. For mixed-register dialogue, generate each part separately and cut them together.
  • No magic above the vocoder ceiling. The audio chain passes through a mel spectrogram bottleneck. Breathy whisper HF energy gets partially smoothed. Expect intimate and quiet, not studio-crisp ASMR.
  • Video is untouched by design. If you want the visuals to also feel ASMR (soft lighting, close-up framing), describe that in the scene prompt β€” the LoRA won't help or hurt.

Training details

Base model LTX-2.3 dev
Epochs ~27
Steps 2000
Rank / Alpha 32 / 32
Target modules audio_attn1, audio_attn2, video_to_audio_attn
Training resolution 192Γ—192, 97 frames (~4s @ 24fps)
Dataset 74 clips, 8 voices (4F / 4M), 2 registers each

Clips were 4-second segments sourced from ASMR content across 8 speakers β€” 4 female (2 soft-spoken, 2 whisper) and 4 male (2 soft-spoken, 2 whisper). Captions used Whisper ASR transcription in (gender, manner): "transcript" format.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MLXBits/ltx2.3-whisper-softly-spoken

Adapter
(338)
this model