VocalNet/VocalNet-8B · Hugging Face

🎧 VocalNet-8B Model Card

VocalNet-8B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.1-8B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. 🚀

📂 Paper, Code and Model Access

Arxiv: VocalNet Report 📖
GitHub: VocalNet Repository 🌐
HuggingFace: VocalNet/VocalNet-8B 🤗
ModelScope: VocalNet/VocalNet-8B 🔮

🔧 Repository Download and Environment Setup

To get started with VocalNet-8B, clone the repository and set up the environment as follows. 🛠️

Clone the Repository:

git clone https://github.com/SJTU-OmniAgent/VocalNet.git
cd VocalNet

Create and Activate Environment:

conda create -n vocalnet python==3.10
conda activate vocalnet

Install Dependencies:

pip install --upgrade pip
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -e .

Optional: Install Training Packages: If you plan to train the model, install additional packages:
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```

📥 Download Instructions

Via Huggingface Cli:

pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-8B --local-dir ./checkpoints/

Via Snapshot Download:

pip install -U huggingface_hub

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id="VocalNet/VocalNet-8B",
  local_dir="./checkpoints/",
  resume_download=True
)

Via Git:

git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-8B

🛠️ Dependencies

Speech Encoder: Whisper-large-v3 🎤
Vocoder: CosyVoice2-0.5B for converting speech tokens to audio waveforms. 🔊

🔄 Local Inference

To perform inference with VocalNet-8B, follow these steps to set up and run the model locally. 📡

Model Preparation:
- Download VocalNet-8B from HuggingFace or ModelScope. 📦
- Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the ./models/speech_encoder/ directory. 🎤
CosyVoice Preparation:
- VocalNet-8B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. 🔊

Path Modification:

Update the paths in omni_speech/infer/vocalnet.py to point to the downloaded models:

COSYVOICE_MODEL=""  # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
VOCALNET_MODEL=""  # Path to VocalNet-8B, e.g., ./checkpoints/VocalNet-8B

Run Inference:

For speech-to-text (S2T) inference:

python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav

For speech-to-speech (S2S) inference:

python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./

📊 Performance Evaluation

VocalNet-8B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.

Overall Performance

Model	LLM Size	Modality	AlpacaEval	LLaMA Questions	TriviaQA	Web Questions
Tiny Models
Mini-Omni	0.5B	s→t	1.84	2.7	0.12	0.22
Mini-Omni	0.5B	s→s	1.80	2.7	0.08	0.20
SLAM-Omni	0.5B	s→t	3.50	29.4	0.39	0.84
SLAM-Omni	0.5B	s→s	3.01	26.7	0.34	0.69
VocalNet-1B (VA)	1B	s→t	5.38	70.3	3.38	4.93
VocalNet-1B (VA)	1B	s→s	4.83	61.0	2.78	4.47
VocalNet-1B	1B	s→t	5.79	71.7	3.60	5.16
VocalNet-1B	1B	s→s	5.03	63.7	3.06	4.68
Base Models
LLaMA-Omni	8B	s→t	5.31	69.7	4.44	5.44
LLaMA-Omni	8B	s→s	3.89	55.1	2.44	4.00
Freeze-Omni	7B	s→t	4.51	77.7	5.32	6.41
Freeze-Omni	7B	s→s	2.99	60.2	3.53	4.78
GLM-4-Voice	9B	s→t	5.86	77.4	4.95	5.56
GLM-4-Voice	9B	s→s	5.27	64.3	4.63	5.40
Baichuan-Omni-1.5	7B	s→t	5.20	77.6	5.72	6.12
Baichuan-Omni-1.5	7B	s→s	4.10	61.2	4.13	5.18
MiniCPM-o	8B	s→t	6.13	77.2	6.43	7.16
MiniCPM-o	8B	s→s	4.95	65.8	4.99	6.22
Minmo*	8B	s→t	-	78.9	4.83	5.50
Minmo*	8B	s→s	6.48	64.1	3.75	3.99
Qwen2.5-Omni	8B	s→t	6.01	79.0	5.89	6.88
Qwen2.5-Omni	8B	s→s	5.73	76.3	5.59	6.70
VocalNet-8B (VA)	8B	s→t	7.05	77.1	6.15	6.34
VocalNet-8B (VA)	8B	s→s	6.30	71.4	5.24	5.81
VocalNet-8B	8B	s→t	7.12	79.5	6.24	6.48
VocalNet-8B	8B	s→s	6.37	73.1	5.67	6.16

Response Alignment and Acoustic Quality

Model	AlpacaEval		LLaMA Questions		TriviaQA		Web Questions		Avg
Model	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS	WER	UTMOS
Tiny Models
Mini-Omni	20.78	4.429	5.20	4.428	7.43	4.428	8.51	4.433	8.66	4.430
SLAM-Omni	5.52	4.439	5.55	4.467	6.16	4.470	6.50	4.461	6.17	4.464
VocalNet-1B (VA)	3.43	4.495	3.65	4.498	5.97	4.499	6.40	4.489	5.66	4.495
VocalNet-1B	3.43	4.491	3.27	4.497	6.73	4.486	4.88	4.493	5.31	4.491
Base Models
LLaMA-Omni	6.00	3.942	10.00	4.003	20.93	3.965	14.60	3.935	15.90	3.956
Freeze-Omni	14.33	4.377	14.20	4.417	20.39	4.404	18.25	4.398	18.31	4.401
GLM-4-Voice	18.71	4.025	14.45	4.152	8.33	4.306	6.08	4.214	8.99	4.228
Baichuan-Omni-1.5	20.84	4.082	22.82	4.332	22.36	4.401	23.29	4.350	22.67	4.347
MiniCPM-o	15.35	4.102	5.73	4.228	8.08	4.128	8.94	4.125	8.72	4.137
Qwen2.5-Omni	2.41	4.299	0.93	4.315	1.13	4.339	4.68	4.363	2.63	4.342
VocalNet-8B (VA)	2.65	4.490	3.00	4.503	5.02	4.499	4.21	4.485	4.26	4.493
VocalNet-8B	4.71	4.489	2.68	4.500	4.04	4.482	3.11	4.492	3.56	4.489

✍️ Citation

If you find our work useful, please cite:

@article{wang2025vocalnet,
  title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
  author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
  journal={arXiv preprint arXiv:2504.04060},
  year={2025}
}

VocalNet
/

VocalNet-8B