microsoft
/

unispeech-large-multi-lingual-1500h-cv

@@ -23,7 +23,7 @@ with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
 Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
 **Abstract**
-In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.
 The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
@@ -32,8 +32,6 @@ The original model can be found under https://github.com/microsoft/UniSpeech/tre
 This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
 used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
-See the following sections for more details on how to fine-tune the model.
 **Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
 of phonemes before fine-tuning.

 Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
 **Abstract**
+*In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.*
 The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
 This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
 used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
 **Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
 of phonemes before fine-tuning.