π€ RoboVoice Converter
Turn any speech into a clean, characterful robot voice β by translating between two different neural audio codecs and decoding through the "wrong" one on purpose.
It chains three models into one audio-to-audio pipeline:
input audio βββΊ DramaBox / LTX-2.3 audio VAE encoder βββΊ DramaBox latent (25 Hz, 128-d)
βββΊ translator (CNN, this repo) βββΊ DACVAE latent (25 Hz, 128-d)
βββΊ DACVAE decoder βββΊ 48 kHz "robo voice" output
The robo timbre is emergent: the translator learns to map one codec's latent space to the other's, but the DramaBox side is band-limited (16 kHz / 8 kHz audio) while DACVAE renders at 48 kHz β so the decoder reconstructs a clean, slightly synthetic, instantly-recognizable robot voice.
π Samples (original β robo)
| # | Input (original speech) | π€ Robo output |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 | ||
| 5 |
(Source clips: held-out English podcast speech.)
How it was made
The idea. Two open neural audio codecs happen to share an identical latent grid β 25 Hz, 128 dimensions per frame:
- DramaBox (Resemble AI's expressive TTS, an IC-LoRA on LTX-2.3's audio branch) uses a 16 kHz mel-spectrogram VAE.
- DACVAE (
facebook/dacvae-watermarked) is a 48 kHz waveform VAE with a high-quality decoder.
Because the two latent spaces are frame-aligned, you can train a small network to translate one into the other, frame-for-frame.
The data. Starting from TTS-AGI/podcast-tokenized-bg3.5-enj5
(podcast audio pre-encoded as DACVAE latents), each sample is turned into a training pair:
DACVAE latent (target) ββDACVAE.decodeβββΊ 48 kHz wav ββDramaBox VAE encodeβββΊ DramaBox latent (input)
20 k paired latents were built this way (pipeline runs at ~9 samples/s on 2Γ RTX 3090). The pairs are
released at TTS-AGI/podcast-dramabox-dacvae-pairs.
The training trick. Regressing the target latent with MSE only reached log-mel correlation β 0.69 β the output was recognizable but distorted, because small latent errors blow up into audible magnitude errors. The fix that mattered: a decode-space loss β decode the predicted latent through DACVAE inside the training loop and apply a multi-resolution STFT + mel loss against the target audio (plus a light latent-L1 anchor). That jumped quality to log-mel correlation β 0.97.
Architecture. A dilated 1-D residual CNN (β16 M params) beat both a small and a (from-scratch) larger Transformer under the decode-space loss β the mapping is mostly a local per-frame spectral transform, which big dilated convolutions capture well.
| translator | log-mel corr |
|---|---|
| latent-MSE (transformer) | 0.69 |
| decode-loss transformer (small) | 0.96 |
| decode-loss CNN (shipped here) | 0.97 |
Usage
pip install -r requirements.txt
git clone https://github.com/kadirnar/fast-dacvae # DACVAE codec
git clone https://github.com/resemble-ai/DramaBox # DramaBox / LTX-2.3 audio VAE
# place the DramaBox audio components checkpoint at weights/dramabox-audio-components.safetensors
# (auto-downloads facebook/dacvae-watermarked on first run)
python convert.py input.wav robo_output.wav
from robovoice import RoboVoice
import soundfile as sf
rv = RoboVoice() # loads CNN translator + DACVAE + DramaBox encoder
wav, sr = sf.read("input.wav")
out, osr = rv.convert(wav, sr) # -> 48 kHz robo voice
sf.write("robo.wav", out, osr)
Components & licenses
- Translator (this repo,
weights/cnn_dec.pt): the only trained-here component. - DACVAE decoder:
facebook/dacvae-watermarked. - DramaBox / LTX-2.3 audio VAE encoder:
ResembleAI/Dramabox(LTX-2 Community License). Respect the upstream licenses of both codecs.
Limitations
- Mono, English speech in-domain (trained on podcasts). Music / noise / other languages may degrade.
- The "robo" character is intrinsic to the method (band gap + cross-codec translation) β it cleans up vocoder-style artifacts but does not recover true >8 kHz detail.