File size: 3,393 Bytes
b6aac78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language: All languages
datasets: ISML datasets (80 thousands hours unlabeled data) + babel datasets (2 thousands unlabeled data)

# Chinese W2v-conformer
## Model description
This is the set of Speech W2v-conformer model pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/):

## How to use
You can use the model directly with a pipeline for speech recognition:
```python
>>> from wenet.dataset.dataset import CollateFunc, AudioDataset
>>> from wenet.transformer.asr_model import ASRModel
>>> from wenet.transformer.encoder import ConformerEncoder
>>> from wenet.transformer.decoder import TransformerDecoder
>>> from wenet.transformer.ctc import CTC
>>> from wenet.utils.executor import Executor
>>> from wenet.utils.checkpoint import save_checkpoint, load_checkpoint
>>> encoder = ConformerEncoder(input_dim, **configs['encoder_conf'])
>>> decoder = TransformerDecoder(vocab_size, encoder.output_size(), **configs['decoder_conf'])
>>> ctc = CTC(vocab_size, encoder.output_size())
>>> with open(args.config, 'r') as fin: configs = yaml.load(fin)
>>> model = ASRModel(
        vocab_size=vocab_size,
        encoder=encoder,
        decoder=decoder,
        ctc=ctc,
        **configs['model_conf'],
    )
>>> infos = load_checkpoint(model, args.checkpoint)

```

## Training data
ISML datasets (80 thousands hours unlabeled data) and babel datasets (2 thousands unlabeled data) are used as training data. 
## Training procedure
The model is pre-trained by wav2vec2 (https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 70 epochs with a batch size of 128. We use the same hyper-parameters on different model sizes.
The downstream models are finetuned: 
```
Stage 1:
```
        python wenet/bin/train.py --gpu 0,1,2,3,4,5,6,7 \
            --config $train_config \
            --train_data train.data \
            --cv_data dev.data \
            ${checkpoint:+--checkpoint $checkpoint} \
            --model_dir $dir \
            --ddp.init_method $init_method \
            --ddp.world_size 7 \
            --ddp.dist_backend nccl \
            --num_workers 2 
```

### BibTeX entry and citation info
```
@article{baevski2020wav2vec,
  title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}

@article{zhang2020pushing,
  title={Pushing the limits of semi-supervised learning for automatic speech recognition},
  author={Zhang, Yu and Qin, James and Park, Daniel S and Han, Wei and Chiu, Chung-Cheng and Pang, Ruoming and Le, Quoc V and Wu, Yonghui},
  journal={arXiv preprint arXiv:2010.10504},
  year={2020}
}

@article{zhang2021wenet,
  title={WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit},
  author={Zhang, Binbin and Wu, Di and Yang, Chao and Chen, Xiaoyu and Peng, Zhendong and Wang, Xiangming and Yao, Zhuoyuan and Wang, Xiong and Yu, Fan and Xie, Lei and others},
  journal={arXiv preprint arXiv:2102.01547},
  year={2021}
}
```
[base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
[large]:https://huggingface.co/uer/albert-large-chinese-cluecorpussmall