Safetensors
English
omni_speech2s_llama

๐ŸŽง VocalNet-8B Model Card

VocalNet-8B is a high-performance, low-latency speech large language model (LLM) optimized for real-time voice interaction. Built upon LLaMA-3.1-8B-Instruct, it employs multi-token prediction (MTP) to significantly enhance generation speed and quality, surpassing most mainstream speech and omni-modal LLMs. ๐Ÿš€

๐Ÿ“‚ Paper, Code and Model Access

๐Ÿ”ง Repository Download and Environment Setup

To get started with VocalNet-8B, clone the repository and set up the environment as follows. ๐Ÿ› ๏ธ

  1. Clone the Repository:

    git clone https://github.com/SJTU-OmniAgent/VocalNet.git
    cd VocalNet
    
  2. Create and Activate Environment:

    conda create -n vocalnet python==3.10
    conda activate vocalnet
    
  3. Install Dependencies:

    pip install --upgrade pip
    conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
    pip install -e .
    
  4. Optional: Install Training Packages: If you plan to train the model, install additional packages:

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation
    

๐Ÿ“ฅ Download Instructions

Via Huggingface Cli:

pip install -U huggingface_hub
huggingface-cli download VocalNet/VocalNet-8B --local-dir ./checkpoints/

Via Snapshot Download:

pip install -U huggingface_hub
from huggingface_hub import snapshot_download
snapshot_download(
  repo_id="VocalNet/VocalNet-8B",
  local_dir="./checkpoints/",
  resume_download=True
)

Via Git:

git lfs install
git clone https://huggingface.co/VocalNet/VocalNet-8B

๐Ÿ› ๏ธ Dependencies

๐Ÿ”„ Local Inference

To perform inference with VocalNet-8B, follow these steps to set up and run the model locally. ๐Ÿ“ก

  1. Model Preparation:

    • Download VocalNet-8B from HuggingFace or ModelScope. ๐Ÿ“ฆ
    • Download the Whisper-large-v3 speech encoder from HuggingFace and place it in the ./models/speech_encoder/ directory. ๐ŸŽค
  2. CosyVoice Preparation:

    • VocalNet-8B uses CosyVoice2-0.5B to convert generated speech tokens into audio waveforms. Download it from HuggingFace. ๐Ÿ”Š
  3. Path Modification:

    • Update the paths in omni_speech/infer/vocalnet.py to point to the downloaded models:
      COSYVOICE_MODEL=""  # Path to CosyVoice2-0.5B, e.g., /workspace/CosyVoice/pretrained_models/CosyVoice2-0.5B-VocalNet
      VOCALNET_MODEL=""  # Path to VocalNet-8B, e.g., ./checkpoints/VocalNet-8B
      
  4. Run Inference:

    • For speech-to-text (S2T) inference:
      python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav
      
    • For speech-to-speech (S2S) inference:
      python3 omni_speech/infer/vocalnet.py --query_audio ./omni_speech/infer/llama_questions_42.wav --s2s --save_dir ./
      

๐Ÿ“Š Performance Evaluation

VocalNet-8B was evaluated on OpenAudioBench, covering AlpacaEval, LLaMA Questions, TriviaQA, and Web Questions. Bold indicates the optimal result in each subgroup.

Overall Performance

Model LLM Size Modality AlpacaEval LLaMA Questions TriviaQA Web Questions
Tiny Models
Mini-Omni 0.5B sโ†’t 1.84 2.7 0.12 0.22
sโ†’s 1.80 2.7 0.08 0.20
SLAM-Omni 0.5B sโ†’t 3.50 29.4 0.39 0.84
sโ†’s 3.01 26.7 0.34 0.69
VocalNet-1B (VA) 1B sโ†’t 5.38 70.3 3.38 4.93
sโ†’s 4.83 61.0 2.78 4.47
VocalNet-1B 1B sโ†’t 5.79 71.7 3.60 5.16
sโ†’s 5.03 63.7 3.06 4.68
Base Models
LLaMA-Omni 8B sโ†’t 5.31 69.7 4.44 5.44
sโ†’s 3.89 55.1 2.44 4.00
Freeze-Omni 7B sโ†’t 4.51 77.7 5.32 6.41
sโ†’s 2.99 60.2 3.53 4.78
GLM-4-Voice 9B sโ†’t 5.86 77.4 4.95 5.56
sโ†’s 5.27 64.3 4.63 5.40
Baichuan-Omni-1.5 7B sโ†’t 5.20 77.6 5.72 6.12
sโ†’s 4.10 61.2 4.13 5.18
MiniCPM-o 8B sโ†’t 6.13 77.2 6.43 7.16
sโ†’s 4.95 65.8 4.99 6.22
Minmo* 8B sโ†’t - 78.9 4.83 5.50
sโ†’s 6.48 64.1 3.75 3.99
Qwen2.5-Omni 8B sโ†’t 6.01 79.0 5.89 6.88
sโ†’s 5.73 76.3 5.59 6.70
VocalNet-8B (VA) 8B sโ†’t 7.05 77.1 6.15 6.34
sโ†’s 6.30 71.4 5.24 5.81
VocalNet-8B 8B sโ†’t 7.12 79.5 6.24 6.48
sโ†’s 6.37 73.1 5.67 6.16

Response Alignment and Acoustic Quality

Model AlpacaEval LLaMA Questions TriviaQA Web Questions Avg
WER UTMOS WER UTMOS WER UTMOS WER UTMOS WER UTMOS
Tiny Models
Mini-Omni 20.78 4.429 5.20 4.428 7.43 4.428 8.51 4.433 8.66 4.430
SLAM-Omni 5.52 4.439 5.55 4.467 6.16 4.470 6.50 4.461 6.17 4.464
VocalNet-1B (VA) 3.43 4.495 3.65 4.498 5.97 4.499 6.40 4.489 5.66 4.495
VocalNet-1B 3.43 4.491 3.27 4.497 6.73 4.486 4.88 4.493 5.31 4.491
Base Models
LLaMA-Omni 6.00 3.942 10.00 4.003 20.93 3.965 14.60 3.935 15.90 3.956
Freeze-Omni 14.33 4.377 14.20 4.417 20.39 4.404 18.25 4.398 18.31 4.401
GLM-4-Voice 18.71 4.025 14.45 4.152 8.33 4.306 6.08 4.214 8.99 4.228
Baichuan-Omni-1.5 20.84 4.082 22.82 4.332 22.36 4.401 23.29 4.350 22.67 4.347
MiniCPM-o 15.35 4.102 5.73 4.228 8.08 4.128 8.94 4.125 8.72 4.137
Qwen2.5-Omni 2.41 4.299 0.93 4.315 1.13 4.339 4.68 4.363 2.63 4.342
VocalNet-8B (VA) 2.65 4.490 3.00 4.503 5.02 4.499 4.21 4.485 4.26 4.493
VocalNet-8B 4.71 4.489 2.68 4.500 4.04 4.482 3.11 4.492 3.56 4.489

โœ๏ธ Citation

If you find our work useful, please cite:

@article{wang2025vocalnet,
  title={VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation},
  author={Wang, Yuhao and Liu, Heyang and Cheng, Ziyang and Wu, Ronghua and Gu, Qunshan and Wang, Yanfeng and Wang, Yu},
  journal={arXiv preprint arXiv:2504.04060},
  year={2025}
}
Downloads last month
4
Safetensors
Model size
11.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for VocalNet/VocalNet-8B

Finetuned
(1227)
this model

Datasets used to train VocalNet/VocalNet-8B