๐ง VocalNet-8B Model Card
VocalNet-8B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.1-8B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. ๐
๐ Paper, Code and Model Access
- Arxiv: VocalNet Report ๐
- GitHub: VocalNet Repository ๐
- HuggingFace: VocalNet/VocalNet-8B ๐ค
- ModelScope: VocalNet/VocalNet-8B ๐ฎ
๐ง Repository Download and Environment Setup
To get started with VocalNet-8B, clone the repository and set up the environment as follows. ๐ ๏ธ
Clone the Repository:
git clone https://github.com/SJTU-OmniAgent/VocalNet.git cd VocalNet
Create and Activate Environment:
conda create -n vocalnet python==3.10 conda activate vocalnet
Install Dependencies:
pip install --upgrade pip conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -e .
Optional: Install Training Packages: If you plan to train the model, install additional packages:
pip install -e ".[train]" pip install flash-attn --no-build-isolation
๐ฅ Download Instructions
Via Huggingface Cli:
pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-8B --local-dir ./checkpoints/
Via Snapshot Download:
pip install -U huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="VocalNet/VocalNet-8B",
local_dir="./checkpoints/",
resume_download=True
)
Via Git:
git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-8B
๐ ๏ธ Dependencies
- Speech Encoder: Whisper-large-v3 ๐ค
- Vocoder: CosyVoice2-0.5B for converting speech tokens to audio waveforms. ๐
๐ Local Inference
To perform inference with VocalNet-8B, follow these steps to set up and run the model locally. ๐ก
Model Preparation:
- Download VocalNet-8B from HuggingFace or ModelScope. ๐ฆ
- Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the
./models/speech_encoder/
directory. ๐ค
CosyVoice Preparation:
- VocalNet-8B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. ๐
Path Modification:
- Update the paths in
omni_speech/infer/vocalnet.py
to point to the downloaded models:COSYVOICE_MODEL="" # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet VOCALNET_MODEL="" # Path to VocalNet-8B, e.g., ./checkpoints/VocalNet-8B
- Update the paths in
Run Inference:
- For speech-to-text (S2T) inference:
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
- For speech-to-speech (S2S) inference:
python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
- For speech-to-text (S2T) inference:
๐ Performance Evaluation
VocalNet-8B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.
Overall Performance
Model | LLM Size | Modality | AlpacaEval | LLaMA Questions | TriviaQA | Web Questions |
---|---|---|---|---|---|---|
Tiny Models | ||||||
Mini-Omni | 0.5B | sโt | 1.84 | 2.7 | 0.12 | 0.22 |
sโs | 1.80 | 2.7 | 0.08 | 0.20 | ||
SLAM-Omni | 0.5B | sโt | 3.50 | 29.4 | 0.39 | 0.84 |
sโs | 3.01 | 26.7 | 0.34 | 0.69 | ||
VocalNet-1B (VA) | 1B | sโt | 5.38 | 70.3 | 3.38 | 4.93 |
sโs | 4.83 | 61.0 | 2.78 | 4.47 | ||
VocalNet-1B | 1B | sโt | 5.79 | 71.7 | 3.60 | 5.16 |
sโs | 5.03 | 63.7 | 3.06 | 4.68 | ||
Base Models | ||||||
LLaMA-Omni | 8B | sโt | 5.31 | 69.7 | 4.44 | 5.44 |
sโs | 3.89 | 55.1 | 2.44 | 4.00 | ||
Freeze-Omni | 7B | sโt | 4.51 | 77.7 | 5.32 | 6.41 |
sโs | 2.99 | 60.2 | 3.53 | 4.78 | ||
GLM-4-Voice | 9B | sโt | 5.86 | 77.4 | 4.95 | 5.56 |
sโs | 5.27 | 64.3 | 4.63 | 5.40 | ||
Baichuan-Omni-1.5 | 7B | sโt | 5.20 | 77.6 | 5.72 | 6.12 |
sโs | 4.10 | 61.2 | 4.13 | 5.18 | ||
MiniCPM-o | 8B | sโt | 6.13 | 77.2 | 6.43 | 7.16 |
sโs | 4.95 | 65.8 | 4.99 | 6.22 | ||
Minmo* | 8B | sโt | - | 78.9 | 4.83 | 5.50 |
sโs | 6.48 | 64.1 | 3.75 | 3.99 | ||
Qwen2.5-Omni | 8B | sโt | 6.01 | 79.0 | 5.89 | 6.88 |
sโs | 5.73 | 76.3 | 5.59 | 6.70 | ||
VocalNet-8B (VA) | 8B | sโt | 7.05 | 77.1 | 6.15 | 6.34 |
sโs | 6.30 | 71.4 | 5.24 | 5.81 | ||
VocalNet-8B | 8B | sโt | 7.12 | 79.5 | 6.24 | 6.48 |
sโs | 6.37 | 73.1 | 5.67 | 6.16 |
Response Alignment and Acoustic Quality
Model | AlpacaEval | LLaMA Questions | TriviaQA | Web Questions | Avg | |||||
WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | WER | UTMOS | |
Tiny Models | ||||||||||
Mini-Omni | 20.78 | 4.429 | 5.20 | 4.428 | 7.43 | 4.428 | 8.51 | 4.433 | 8.66 | 4.430 |
SLAM-Omni | 5.52 | 4.439 | 5.55 | 4.467 | 6.16 | 4.470 | 6.50 | 4.461 | 6.17 | 4.464 |
VocalNet-1B (VA) | 3.43 | 4.495 | 3.65 | 4.498 | 5.97 | 4.499 | 6.40 | 4.489 | 5.66 | 4.495 |
VocalNet-1B | 3.43 | 4.491 | 3.27 | 4.497 | 6.73 | 4.486 | 4.88 | 4.493 | 5.31 | 4.491 |
Base Models | ||||||||||
LLaMA-Omni | 6.00 | 3.942 | 10.00 | 4.003 | 20.93 | 3.965 | 14.60 | 3.935 | 15.90 | 3.956 |
Freeze-Omni | 14.33 | 4.377 | 14.20 | 4.417 | 20.39 | 4.404 | 18.25 | 4.398 | 18.31 | 4.401 |
GLM-4-Voice | 18.71 | 4.025 | 14.45 | 4.152 | 8.33 | 4.306 | 6.08 | 4.214 | 8.99 | 4.228 |
Baichuan-Omni-1.5 | 20.84 | 4.082 | 22.82 | 4.332 | 22.36 | 4.401 | 23.29 | 4.350 | 22.67 | 4.347 |
MiniCPM-o | 15.35 | 4.102 | 5.73 | 4.228 | 8.08 | 4.128 | 8.94 | 4.125 | 8.72 | 4.137 |
Qwen2.5-Omni | 2.41 | 4.299 | 0.93 | 4.315 | 1.13 | 4.339 | 4.68 | 4.363 | 2.63 | 4.342 |
VocalNet-8B (VA) | 2.65 | 4.490 | 3.00 | 4.503 | 5.02 | 4.499 | 4.21 | 4.485 | 4.26 | 4.493 |
VocalNet-8B | 4.71 | 4.489 | 2.68 | 4.500 | 4.04 | 4.482 | 3.11 | 4.492 | 3.56 | 4.489 |
โ๏ธ Citation
If you find our work useful, please cite:
@article{wang2025vocalnet,
title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
journal={arXiv preprint arXiv:2504.04060},
year={2025}
}
- Downloads last month
- 4
Model tree for VocalNet/VocalNet-8B
Base model
meta-llama/Llama-3.1-8B