ASLP-lab's picture
Update README.md
192789f verified

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: YingMusic-Singer-Plus
emoji: ๐ŸŽค
colorFrom: pink
colorTo: blue
sdk: gradio
python_version: '3.10'
app_file: app.py
tags:
  - singing-voice-synthesis
  - lyric-editing
  - diffusion-model
  - reinforcement-learning
short_description: Edit lyrics, keep the melody
fullWidth: true

๐ŸŽค YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

English ๏ฝœ ไธญๆ–‡

Python License

arXiv Paper GitHub Demo Page HuggingFace Space HuggingFace Model Dataset LyricEditBench Discord WeChat Lab

Chunbo Hao1,2 ยท Junjie Zheng2 ยท Guobin Ma1 ยท Yuepeng Jiang1 ยท Huakang Chen1 ยท Wenjie Tian1 ยท Gongyu Chen2 ยท Zihao Chen2 ยท Lei Xie1

1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, China
2 AI Lab, GiantNetwork, China

YingMusic-Singer Architecture

Overall architecture of YingMusic-Singer-Plus. Left: SFT training pipeline. Right: GRPO training pipeline.

๐Ÿ“– Introduction

YingMusic-Singer-Plus is a fully diffusion-based singing voice synthesis model that enables melody-controllable singing voice editing with flexible lyric manipulation, requiring no manual alignment or precise phoneme annotation.

Given only three inputs โ€” an optional timbre reference, a melody-providing singing clip, and modified lyrics โ€” YingMusic-Singer-Plus synthesizes high-fidelity singing voices at 44.1 kHz while faithfully preserving the original melody.

โœจ Key Features

  • Annotation-free: No manual lyric-MIDI alignment required at inference
  • Flexible lyric manipulation: Supports 6 editing types โ€” partial/full changes, insertion, deletion, translation (CNโ†”EN), and code-switching
  • Strong melody preservation: CKA-based melody alignment loss + GRPO-based optimization
  • Bilingual: Unified IPA tokenizer for both Chinese and English
  • High fidelity: 44.1 kHz stereo output via Stable Audio 2 VAE

๐Ÿš€ Quick Start

Option 1: Install from Scratch

conda create -n YingMusic-Singer-Plus python=3.10
conda activate YingMusic-Singer-Plus

# uv is much faster than pip
pip install uv
uv pip install -r requirements.txt

Option 2: Pre-built Conda Environment

  1. Download and install Miniconda from https://repo.anaconda.com/miniconda/ for your platform. Verify with conda --version.
  2. Download the pre-built environment package for your setup from the table below.
  3. In your Conda directory, navigate to envs/ and create a folder named YingMusic-Singer-Plus.
  4. Move the downloaded package into that folder, then extract it with tar -xvf <package_name>.
CPU Architecture GPU OS Download
ARM NVIDIA Linux Coming soon
AMD64 NVIDIA Linux Coming soon
AMD64 NVIDIA Windows Coming soon

Option 3: Docker

Build the image:

docker build -t YingMusic-Singer-Plus .

Run inference:

docker run --gpus all -it YingMusic-Singer-Plus

๐ŸŽต Inference

Option 1: Online Demo (HuggingFace Space)

Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer-Plus to try the model instantly in your browser.

Option 2: Local Gradio App (same as online demo)

python app_local.py

Option 3: Command-line Inference

python infer_api.py \
    --ref_audio path/to/ref.wav \
    --melody_audio path/to/melody.wav \
    --ref_text "่ฏฅไฝ“่ฐ…็š„ไธๆ‰ง็€|ๅฆ‚ๆžœ้‚ฃๅคฉๆˆ‘" \
    --target_text "ๅฅฝๅคšๅคฉ|็œ‹ไธๅฎŒไฝ " \
    --output output.wav

Enable vocal separation and accompaniment mixing:

python infer_api.py \
    --ref_audio ref.wav \
    --melody_audio melody.wav \
    --ref_text "..." \
    --target_text "..." \
    --separate_vocals \      # separate vocals from the input before processing
    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
    --output mixed_output.wav

Option 4: Batch Inference

Note: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using src/third_party/MusicSourceSeparationTraining/inference_api.py.

The input JSONL file should contain one JSON object per line, formatted as follows:

{"id": "1", "melody_ref_path": "XXX", "gen_text": "ๅฅฝๅคšๅคฉ|็œ‹ไธๅฎŒไฝ ", "timbre_ref_path": "XXX", "timbre_ref_text": "่ฏฅไฝ“่ฐ…็š„ไธๆ‰ง็€|ๅฆ‚ๆžœ้‚ฃๅคฉๆˆ‘"}
python batch_infer.py \
    --input_type jsonl \
    --input_path /path/to/input.jsonl \
    --output_dir /path/to/output \
    --ckpt_path /path/to/ckpts \
    --num_gpus 4

Multi-process inference on LyricEditBench (melody control) โ€” the test set will be downloaded automatically:

python inference_mp.py \
    --input_type lyric_edit_bench_melody_control \
    --output_dir path/to/LyricEditBench_melody_control \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

Multi-process inference on LyricEditBench (singing edit):

python inference_mp.py \
    --input_type lyric_edit_bench_sing_edit \
    --output_dir path/to/LyricEditBench_sing_edit \
    --ckpt_path ASLP-lab/YingMusic-Singer-Plus \
    --num_gpus 8

๐Ÿ—๏ธ Model Architecture

YingMusic-Singer-Plus consists of four core components:

Component Description
VAE Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048ร—
Melody Extractor Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information
IPA Tokenizer Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment
DiT-based CFM Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024)

Total parameters: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)

๐Ÿ“Š LyricEditBench

We introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation, built on GTSinger. The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.

Results

Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics โ€” P: PER, S: SIM, F: F0-CORR, V: VS โ€” are detailed in Section 3. Best results in bold.

LyricEditBench Results

๐Ÿ™ Acknowledgements

This work builds upon the following open-source projects:

๐Ÿ“„ License

The code and model weights in this project are licensed under CC BY 4.0, except for the following:

The VAE model weights and inference code (in src/YingMusic-Singer-Plus/utils/stable-audio-tools) are derived from Stable Audio Open by Stability AI, and are licensed under the Stability AI Community License.

Institutional Logo