Balacoon Discrete Vocoder
This discrete vocoder consists of both analysis and synthesis components.
- Analysis: Converts audio into audio tokens—four parallel codebooks, each containing 2,048 values.
- Synthesis: Converts audio tokens back into audio.
The vocoder operates with 24 kHz audio at a frame rate of 50. It is designed as a middle ground between the high bitrate of EnCodec and the lower bitrate alternatives like Mimi (12.5 frames per second) or WaveTokenizer (which uses a single codebook).
How to Use the Vocoder:
import torch
import soundfile as sf
from huggingface_hub import hf_hub_download
device = torch.device('cuda')
# load the model
encoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="analysis.jit")
decoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="synthesis.jit")
encoder = torch.jit.load(encoder_path)
decoder = torch.jit.load(decoder_path)
# read the audio
orig_audio_npy, sr = sf.read(path, dtype="int16")
assert sr == 24000
orig_audio = torch.tensor(orig_audio_npy).to(device).unsqueeze(0) # batch x samples
# extract audio tokens from the audio
tokens = encoder(orig_audio) # batch x frames x 4
# synthesize audio from audio tokens
resynthesized_audio = decoder(tokens) # batch x samples
See performance of the codec on vocoder
leaderboard: TTSLeaderboard
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.