--- language: All languages datasets: ISML datasets (80 thousands hours unlabeled data) + babel datasets (2 thousands unlabeled data) # Chinese W2v-conformer ## Model description This is the set of Speech W2v-conformer model pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/): ## How to use You can use the model directly with a pipeline for speech recognition: ```python >>> from wenet.dataset.dataset import CollateFunc, AudioDataset >>> from wenet.transformer.asr_model import ASRModel >>> from wenet.transformer.encoder import ConformerEncoder >>> from wenet.transformer.decoder import TransformerDecoder >>> from wenet.transformer.ctc import CTC >>> from wenet.utils.executor import Executor >>> from wenet.utils.checkpoint import save_checkpoint, load_checkpoint >>> encoder = ConformerEncoder(input_dim, **configs['encoder_conf']) >>> decoder = TransformerDecoder(vocab_size, encoder.output_size(), **configs['decoder_conf']) >>> ctc = CTC(vocab_size, encoder.output_size()) >>> with open(args.config, 'r') as fin: configs = yaml.load(fin) >>> model = ASRModel( vocab_size=vocab_size, encoder=encoder, decoder=decoder, ctc=ctc, **configs['model_conf'], ) >>> infos = load_checkpoint(model, args.checkpoint) ``` ## Training data ISML datasets (80 thousands hours unlabeled data) and babel datasets (2 thousands unlabeled data) are used as training data. ## Training procedure The model is pre-trained by wav2vec2 (https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 70 epochs with a batch size of 128. We use the same hyper-parameters on different model sizes. The downstream models are finetuned: ``` Stage 1: ``` python wenet/bin/train.py --gpu 0,1,2,3,4,5,6,7 \ --config $train_config \ --train_data train.data \ --cv_data dev.data \ ${checkpoint:+--checkpoint $checkpoint} \ --model_dir $dir \ --ddp.init_method $init_method \ --ddp.world_size 7 \ --ddp.dist_backend nccl \ --num_workers 2 ``` ### BibTeX entry and citation info ``` @article{baevski2020wav2vec, title={wav2vec 2.0: A framework for self-supervised learning of speech representations}, author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael}, journal={arXiv preprint arXiv:2006.11477}, year={2020} } @article{zhang2020pushing, title={Pushing the limits of semi-supervised learning for automatic speech recognition}, author={Zhang, Yu and Qin, James and Park, Daniel S and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V and Wu, Yonghui}, journal={arXiv preprint arXiv:2010.10504}, year={2020} } @article{zhang2021wenet, title={WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit}, author={Zhang, Binbin and Wu, Di and Yang, Chao and Chen, Xiaoyu and Peng, Zhendong and Wang, Xiangming and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Xie, Lei and others}, journal={arXiv preprint arXiv:2102.01547}, year={2021} } ``` [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall [large]:https://huggingface.co/uer/albert-large-chinese-cluecorpussmall