LongCat-AudioDiT Env-TTS β€” augment (10,000-step fine-tune)

Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.

This augment variant adds environment-consistent augmentation so the generated target lives in the referenced acoustic scene.

Differences from the base model

Six learnable boundary tokens (three latent-space, three text-space):

latent sequence : [<boe>  z_env  <bos>  z_spk  <bon>  z_target]
text sequence   : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]

encode_multistream_text(...) is the entry-point; AudioDiTModel.forward(...) also accepts a pre-assembled prompt_latent.

Training summary

Field Value
Steps 10,000
Hardware 1Γ— RTX PRO 6000 Blackwell (96 GB), bf16
Effective batch 16 Γ— grad_accum 2 Γ— 1 GPU = 32 rows / step
Learning rate cosine 5e-5 (warmup 250)
AdamW β₁=0.9, Ξ²β‚‚=0.999, wd=0.01
EMA disabled
LoRA r=32, alpha=32, target = attn + ffn
Full-train boundary tokens + AdaLN + text_conv + latent / latent_cond / input embeds + output_proj + time_embed
Data ChristianYang/Env-TTS-Clean
Audio target ∈ [3, 15] s; three-stream RMS-norm to βˆ’23 dBFS; peak-clip at 0.5

Augmentation (the augment change)

Noise + RIR are streamed on-demand from ChristianYang/DNS-Noise (DNS-Challenge noise_fullband + impulse_responses, republished as 24 kHz mono):

  • Speaker ref β€” an independent 50/25/25 draw: clean / noise / noise+RIR, SNR ∈ [βˆ’5, 15] dB.
  • Env + target (coupled) β€” a separate 50/25/25 draw whose same noise clip and same RIR are applied to both env and target, placing the generated target in one consistent acoustic scene. The RIR tail is kept; env/target are capped to 15 s.

How to load

Uses custom code in this repo, so pass trust_remote_code=True:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment",
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

See the training repo's tasks/inference.py for end-to-end env-tts inference.

License

Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment

Finetuned
(10)
this model