docs/README-SVS.md · justyoung/DiffSinger at c99418df68696cfe75f8d0a008b229b62b9e04b7

DiffSinger (SVS version)

PART1. Run DiffSinger on PopCS

In this part, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3].

Thus, the pipeline of this part can be summarized as:

[lyrics] -> [linguistic representation] (Frontend)
[linguistic representation] + [GT F0] + [GT phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] + [GT F0] -> [waveform] (Vocoder)

[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.

[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

PART2. Run DiffSinger on Opencpop

Thanks Opencpop team for releasing their SVS dataset with MIDI label, Jan.20, 2022. (Also thanks to my co-author Yi Ren, who applied for the dataset and did some preprocessing works for this part).

Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.

2.1

Thus, the pipeline of this part can be summarized as:

[lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
[linguistic representation] + [predicted F0] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)

2.2

In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend but make a joint prediction with spectrograms.

Thus, the pipeline of this part can be summarized as:

[lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
[linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
[mel-spectrogram] -> [predicted F0]  (Pitch extractor)
[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)