Balacoon Discrete Vocoder

This discrete vocoder consists of both analysis and synthesis components.

  • Analysis: Converts audio into audio tokens—four parallel codebooks, each containing 2,048 values.
  • Synthesis: Converts audio tokens back into audio.

The vocoder operates with 24 kHz audio at a frame rate of 50. It is designed as a middle ground between the high bitrate of EnCodec and the lower bitrate alternatives like Mimi (12.5 frames per second) or WaveTokenizer (which uses a single codebook).

How to Use the Vocoder:

import torch
import soundfile as sf
from huggingface_hub import hf_hub_download

device = torch.device('cuda')

# load the model
encoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="analysis.jit")
decoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="synthesis.jit")
encoder = torch.jit.load(encoder_path)
decoder = torch.jit.load(decoder_path)

# read the audio 
orig_audio_npy, sr = sf.read(path, dtype="int16")
assert sr == 24000
orig_audio = torch.tensor(orig_audio_npy).to(device).unsqueeze(0)  # batch x samples
# extract audio tokens from the audio
tokens = encoder(orig_audio)  # batch x frames x 4
# synthesize audio from audio tokens
resynthesized_audio = decoder(tokens)  # batch x samples

See performance of the codec on vocoder leaderboard: TTSLeaderboard

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.