mechanicalsea
/

speecht5-sid

Audio Classification

self-supervised learning

Speaker Identification

Speaker Recognition

Model card Files Files and versions Metrics Training metrics Community

speecht5-sid / README.md

mechanicalsea's picture

update README.md

5388d50 over 1 year ago

|

history blame contribute delete

No virus

2.27 kB

	---
	license: mit
	datasets:
	- s3prl/mini_voxceleb1
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: audio-classification
	tags:
	- speech
	- text
	- cross-modal
	- unified model
	- self-supervised learning
	- SpeechT5
	- Speaker Identification
	- Speaker Recognition
	---

	## SpeechT5 SID

	\| [Github](https://github.com/microsoft/SpeechT5) \| [Huggingface](https://huggingface.co/mechanicalsea/speecht5-sid) \|

	This manifest is an attempt to recreate the Speaker Identification recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [VoxCeleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) containing over 100,000 utterances for 1,251 celebrities. The identification split are given as follows.

	\| \| train \| valid \| test \|
	\| ------------------- \| ------: \| ----: \| ----: \|
	\| # of speakers \| 1,251 \| 1,251 \| 1,251 \|
	\| # of utterances \| 138,361 \| 6,904 \| 8,251 \|

	### Requirements

	- [Fairseq](https://github.com/facebookresearch/fairseq)

	### Tools

	- `manifest/utils` is used to produce manifest as well as conduct training, validation, and evaluation.
	- `mainfest/iden_split.txt` and `mainfest/vox1_meta.csv` are officially released files.

	### Model and Results

	- [`speecht5_sid.pt`](./speecht5_sid.pt) are reimplemented Speaker Identification fine-tuning on the released manifest but with a smaller batch size (Ensure the manifest is ok).
	- `results` are reproduced by the released fine-tuned model and the accuracy is $96.194\%$.
	- `log` is the tensorboard log of fine-tuning the released model.

	### Reference

	If you find our work is useful in your research, please cite the following paper:

	```bibtex
	@inproceedings{ao-etal-2022-speecht5,
	title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
	author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	month = {May},
	year = {2022},
	pages={5723--5738},
	}
	```