πŸ€– RoboVoice Converter

Turn any speech into a clean, characterful robot voice β€” by translating between two different neural audio codecs and decoding through the "wrong" one on purpose.

It chains three models into one audio-to-audio pipeline:

input audio ──► DramaBox / LTX-2.3 audio VAE encoder ──► DramaBox latent (25 Hz, 128-d)
            ──► translator (CNN, this repo)            ──► DACVAE latent  (25 Hz, 128-d)
            ──► DACVAE decoder                         ──► 48 kHz "robo voice" output

The robo timbre is emergent: the translator learns to map one codec's latent space to the other's, but the DramaBox side is band-limited (16 kHz / 8 kHz audio) while DACVAE renders at 48 kHz β€” so the decoder reconstructs a clean, slightly synthetic, instantly-recognizable robot voice.

πŸ”Š Samples (original β†’ robo)

#Input (original speech)πŸ€– Robo output
1
2
3
4
5

(Source clips: held-out English podcast speech.)

How it was made

The idea. Two open neural audio codecs happen to share an identical latent grid β€” 25 Hz, 128 dimensions per frame:

  • DramaBox (Resemble AI's expressive TTS, an IC-LoRA on LTX-2.3's audio branch) uses a 16 kHz mel-spectrogram VAE.
  • DACVAE (facebook/dacvae-watermarked) is a 48 kHz waveform VAE with a high-quality decoder.

Because the two latent spaces are frame-aligned, you can train a small network to translate one into the other, frame-for-frame.

The data. Starting from TTS-AGI/podcast-tokenized-bg3.5-enj5 (podcast audio pre-encoded as DACVAE latents), each sample is turned into a training pair:

DACVAE latent (target) ──DACVAE.decode──► 48 kHz wav ──DramaBox VAE encode──► DramaBox latent (input)

20 k paired latents were built this way (pipeline runs at ~9 samples/s on 2Γ— RTX 3090). The pairs are released at TTS-AGI/podcast-dramabox-dacvae-pairs.

The training trick. Regressing the target latent with MSE only reached log-mel correlation β‰ˆ 0.69 β€” the output was recognizable but distorted, because small latent errors blow up into audible magnitude errors. The fix that mattered: a decode-space loss β€” decode the predicted latent through DACVAE inside the training loop and apply a multi-resolution STFT + mel loss against the target audio (plus a light latent-L1 anchor). That jumped quality to log-mel correlation β‰ˆ 0.97.

Architecture. A dilated 1-D residual CNN (β‰ˆ16 M params) beat both a small and a (from-scratch) larger Transformer under the decode-space loss β€” the mapping is mostly a local per-frame spectral transform, which big dilated convolutions capture well.

translator log-mel corr
latent-MSE (transformer) 0.69
decode-loss transformer (small) 0.96
decode-loss CNN (shipped here) 0.97

Usage

pip install -r requirements.txt
git clone https://github.com/kadirnar/fast-dacvae      # DACVAE codec
git clone https://github.com/resemble-ai/DramaBox       # DramaBox / LTX-2.3 audio VAE
# place the DramaBox audio components checkpoint at weights/dramabox-audio-components.safetensors
# (auto-downloads facebook/dacvae-watermarked on first run)

python convert.py input.wav robo_output.wav
from robovoice import RoboVoice
import soundfile as sf
rv = RoboVoice()                 # loads CNN translator + DACVAE + DramaBox encoder
wav, sr = sf.read("input.wav")
out, osr = rv.convert(wav, sr)   # -> 48 kHz robo voice
sf.write("robo.wav", out, osr)

Components & licenses

  • Translator (this repo, weights/cnn_dec.pt): the only trained-here component.
  • DACVAE decoder: facebook/dacvae-watermarked.
  • DramaBox / LTX-2.3 audio VAE encoder: ResembleAI/Dramabox (LTX-2 Community License). Respect the upstream licenses of both codecs.

Limitations

  • Mono, English speech in-domain (trained on podcasts). Music / noise / other languages may degrade.
  • The "robo" character is intrinsic to the method (band gap + cross-codec translation) β€” it cleans up vocoder-style artifacts but does not recover true >8 kHz detail.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support