Instructions to use ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
LongCat-AudioDiT Env-TTS β augment (10,000-step fine-tune)
Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.
This augment variant adds environment-consistent augmentation so the
generated target lives in the referenced acoustic scene.
Differences from the base model
Six learnable boundary tokens (three latent-space, three text-space):
latent sequence : [<boe> z_env <bos> z_spk <bon> z_target]
text sequence : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]
encode_multistream_text(...) is the entry-point; AudioDiTModel.forward(...)
also accepts a pre-assembled prompt_latent.
Training summary
| Field | Value |
|---|---|
| Steps | 10,000 |
| Hardware | 1Γ RTX PRO 6000 Blackwell (96 GB), bf16 |
| Effective batch | 16 Γ grad_accum 2 Γ 1 GPU = 32 rows / step |
| Learning rate | cosine 5e-5 (warmup 250) |
| AdamW | Ξ²β=0.9, Ξ²β=0.999, wd=0.01 |
| EMA | disabled |
| LoRA | r=32, alpha=32, target = attn + ffn |
| Full-train | boundary tokens + AdaLN + text_conv + latent / latent_cond / input embeds + output_proj + time_embed |
| Data | ChristianYang/Env-TTS-Clean |
| Audio | target β [3, 15] s; three-stream RMS-norm to β23 dBFS; peak-clip at 0.5 |
Augmentation (the augment change)
Noise + RIR are streamed on-demand from
ChristianYang/DNS-Noise
(DNS-Challenge noise_fullband + impulse_responses, republished as 24 kHz mono):
- Speaker ref β an independent 50/25/25 draw: clean / noise / noise+RIR, SNR β [β5, 15] dB.
- Env + target (coupled) β a separate 50/25/25 draw whose same noise clip and same RIR are applied to both env and target, placing the generated target in one consistent acoustic scene. The RIR tail is kept; env/target are capped to 15 s.
How to load
Uses custom code in this repo, so pass trust_remote_code=True:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment",
trust_remote_code=True,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)
See the training repo's tasks/inference.py for end-to-end env-tts inference.
License
Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.
- Downloads last month
- -
Model tree for ChristianYang/LongCat-AudioDiT-Env-TTS-1B-augment
Base model
meituan-longcat/LongCat-AudioDiT-1B