Spaces:

amphion
/

Text-to-Speech

Running

App Files Files Community

Text-to-Speech / egs /metrics /README.md

zyingt

Upload 685 files

0d80816 8 months ago

preview code

raw

history blame contribute delete

No virus

3.88 kB

	# Amphion Evaluation Recipe

	## Supported Evaluation Metrics

	Until now, Amphion Evaluation has supported the following objective metrics:

	- F0 Modeling:
	- F0 Pearson Coefficients (FPC)
	- F0 Periodicity Root Mean Square Error (PeriodicityRMSE)
	- F0 Root Mean Square Error (F0RMSE)
	- Voiced/Unvoiced F1 Score (V/UV F1)
	- Energy Modeling:
	- Energy Root Mean Square Error (EnergyRMSE)
	- Energy Pearson Coefficients (EnergyPC)
	- Intelligibility:
	- Character Error Rate (CER) based on [Whipser](https://github.com/openai/whisper)
	- Word Error Rate (WER) based on [Whipser](https://github.com/openai/whisper)
	- Spectrogram Distortion:
	- Frechet Audio Distance (FAD)
	- Mel Cepstral Distortion (MCD)
	- Multi-Resolution STFT Distance (MSTFT)
	- Perceptual Evaluation of Speech Quality (PESQ)
	- Short Time Objective Intelligibility (STOI)
	- Scale Invariant Signal to Distortion Ratio (SISDR)
	- Scale Invariant Signal to Noise Ratio (SISNR)
	- Speaker Similarity:
	- Cosine similarity based on [Rawnet3](https://github.com/Jungjee/RawNet)
	- Cosine similarity based on [WeSpeaker](https://github.com/wenet-e2e/wespeaker) (👨‍💻 developing)

	We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total:

	1. Pretrained Models Preparation
	2. Audio Data Preparation
	3. Evaluation

	## 1. Pretrained Models Preparation

	If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).

	## 2. Aduio Data Preparation

	Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example.

	```plaintext
	┣ {ref_dir}
	┃ ┣ sample1.wav
	┃ ┣ sample2.wav
	┣ {gen_dir}
	┃ ┣ sample1.wav
	┃ ┣ sample2.wav
	```

	You have to make sure that the pairwise reference audio and generated audio are named the same, as illustrated above (sample1 to sample1, sample2 to sample2).

	## 3. Evaluation

	Run the `run.sh` with specified refenrece folder, generated folder, dump folder and metrics.

	```bash
	cd Amphion
	sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics [The metrics you need] \
	```

	As for the metrics, an example is provided below:

	```bash
	--metrics "mcd pesq fad"
	```

	All currently available metrics keywords are listed below:

	\| Keys \| Description \|
	\| --------------------- \| ------------------------------------------ \|
	\| `fpc` \| F0 Pearson Coefficients \|
	\| `f0_periodicity_rmse` \| F0 Periodicity Root Mean Square Error \|
	\| `f0rmse` \| F0 Root Mean Square Error \|
	\| `v_uv_f1` \| Voiced/Unvoiced F1 Score \|
	\| `energy_rmse` \| Energy Root Mean Square Error \|
	\| `energy_pc` \| Energy Pearson Coefficients \|
	\| `cer` \| Character Error Rate \|
	\| `wer` \| Word Error Rate \|
	\| `speaker_similarity` \| Cos Similarity based on RawNet3 \|
	\| `fad` \| Frechet Audio Distance \|
	\| `mcd` \| Mel Cepstral Distortion \|
	\| `mstft` \| Multi-Resolution STFT Distance \|
	\| `pesq` \| Perceptual Evaluation of Speech Quality \|
	\| `si_sdr` \| Scale Invariant Signal to Distortion Ratio \|
	\| `si_snr` \| Scale Invariant Signal to Noise Ratio \|
	\| `stoi` \| Short Time Objective Intelligibility \|