Spaces:

Silentlin
/

DiffSinger

Build error

App Files Files Community

DiffSinger / docs /README-SVS.md

ddd

Add application file

b93970c over 2 years ago

preview code

raw

history blame

2.39 kB

	## DiffSinger (SVS version)

	### PART1. [Run DiffSinger on PopCS](README-SVS-popcs.md)
	In this part, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3].

	Thus, the pipeline of this part can be summarized as:

	```
	[lyrics] -> [linguistic representation] (Frontend)
	[linguistic representation] + [GT F0] + [GT phoneme duration] -> [mel-spectrogram] (Acoustic model)
	[mel-spectrogram] + [GT F0] -> [waveform] (Vocoder)
	```


	[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

	[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.

	[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

	### PART2. [Run DiffSinger on Opencpop](README-SVS-opencpop-cascade.md)
	Thanks [Opencpop team](https://wenet.org.cn/opencpop/) for releasing their SVS dataset with MIDI label, Jan.20, 2022. (Also thanks to my co-author [Yi Ren](https://github.com/RayeRen), who applied for the dataset and did some preprocessing works for this part).

	Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.

	#### 2.1
	Thus, the pipeline of [this part](README-SVS-opencpop-cascade.md) can be summarized as:

	```
	[lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
	[linguistic representation] + [predicted F0] + [predicted phoneme duration] -> [mel-spectrogram] (Acoustic model)
	[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
	```

	#### 2.2
	In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend but make a joint prediction with spectrograms.

	Thus, the pipeline of [this part](README-SVS-opencpop-e2e.md) can be summarized as:
	```
	[lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
	[linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram] (Acoustic model)
	[mel-spectrogram] -> [predicted F0] (Pitch extractor)
	[mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
	```