🤖 RoboVoice Converter

Turn any speech into a clean, characterful robot voice — by translating between two different neural audio codecs and decoding through the "wrong" one on purpose.

It chains three models into one audio-to-audio pipeline:

input audio ──► DramaBox / LTX-2.3 audio VAE encoder ──► DramaBox latent (25 Hz, 128-d)
            ──► translator (CNN, this repo)            ──► DACVAE latent  (25 Hz, 128-d)
            ──► DACVAE decoder                         ──► 48 kHz "robo voice" output

The robo timbre is emergent: the translator learns to map one codec's latent space to the other's, but the DramaBox side is band-limited (16 kHz / 8 kHz audio) while DACVAE renders at 48 kHz — so the decoder reconstructs a clean, slightly synthetic, instantly-recognizable robot voice.

🔊 Samples (original → robo)

#	Input (original speech)	🤖 Robo output
1
2
3
4
5

(Source clips: held-out English podcast speech.)

How it was made

The idea. Two open neural audio codecs happen to share an identical latent grid — 25 Hz, 128 dimensions per frame:

DramaBox (Resemble AI's expressive TTS, an IC-LoRA on LTX-2.3's audio branch) uses a 16 kHz mel-spectrogram VAE.
DACVAE (facebook/dacvae-watermarked) is a 48 kHz waveform VAE with a high-quality decoder.

Because the two latent spaces are frame-aligned, you can train a small network to translate one into the other, frame-for-frame.

The data. Starting from TTS-AGI/podcast-tokenized-bg3.5-enj5 (podcast audio pre-encoded as DACVAE latents), each sample is turned into a training pair:

DACVAE latent (target) ──DACVAE.decode──► 48 kHz wav ──DramaBox VAE encode──► DramaBox latent (input)

20 k paired latents were built this way (pipeline runs at ~9 samples/s on 2× RTX 3090). The pairs are released at TTS-AGI/podcast-dramabox-dacvae-pairs.

The training trick. Regressing the target latent with MSE only reached log-mel correlation ≈ 0.69 — the output was recognizable but distorted, because small latent errors blow up into audible magnitude errors. The fix that mattered: a decode-space loss — decode the predicted latent through DACVAE inside the training loop and apply a multi-resolution STFT + mel loss against the target audio (plus a light latent-L1 anchor). That jumped quality to log-mel correlation ≈ 0.97.

Architecture. A dilated 1-D residual CNN (≈16 M params) beat both a small and a (from-scratch) larger Transformer under the decode-space loss — the mapping is mostly a local per-frame spectral transform, which big dilated convolutions capture well.

translator	log-mel corr
latent-MSE (transformer)	0.69
decode-loss transformer (small)	0.96
decode-loss CNN (shipped here)	0.97

Usage

pip install -r requirements.txt
git clone https://github.com/kadirnar/fast-dacvae      # DACVAE codec
git clone https://github.com/resemble-ai/DramaBox       # DramaBox / LTX-2.3 audio VAE
# place the DramaBox audio components checkpoint at weights/dramabox-audio-components.safetensors
# (auto-downloads facebook/dacvae-watermarked on first run)

python convert.py input.wav robo_output.wav

from robovoice import RoboVoice
import soundfile as sf
rv = RoboVoice()                 # loads CNN translator + DACVAE + DramaBox encoder
wav, sr = sf.read("input.wav")
out, osr = rv.convert(wav, sr)   # -> 48 kHz robo voice
sf.write("robo.wav", out, osr)

Components & licenses

Translator (this repo, weights/cnn_dec.pt): the only trained-here component.
DACVAE decoder: facebook/dacvae-watermarked.
DramaBox / LTX-2.3 audio VAE encoder: ResembleAI/Dramabox (LTX-2 Community License). Respect the upstream licenses of both codecs.

Limitations

Mono, English speech in-domain (trained on podcasts). Music / noise / other languages may degrade.
The "robo" character is intrinsic to the method (band gap + cross-codec translation) — it cleans up vocoder-style artifacts but does not recover true >8 kHz detail.

Downloads last month: -; Downloads are not tracked for this model. How to track