patrickvonplaten commited on
Commit
1382b96
1 Parent(s): 24f5df8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -23,7 +23,7 @@ with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
23
  Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
24
 
25
  **Abstract**
26
- In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.
27
 
28
  The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
29
 
@@ -32,8 +32,6 @@ The original model can be found under https://github.com/microsoft/UniSpeech/tre
32
  This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
33
  used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
34
 
35
- See the following sections for more details on how to fine-tune the model.
36
-
37
  **Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
38
  of phonemes before fine-tuning.
39
 
 
23
  Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
24
 
25
  **Abstract**
26
+ *In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.*
27
 
28
  The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
29
 
 
32
  This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
33
  used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
34
 
 
 
35
  **Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
36
  of phonemes before fine-tuning.
37