patrickvonplaten
commited on
Commit
•
1382b96
1
Parent(s):
24f5df8
Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)
|
|
23 |
Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
|
24 |
|
25 |
**Abstract**
|
26 |
-
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach
|
27 |
|
28 |
The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
|
29 |
|
@@ -32,8 +32,6 @@ The original model can be found under https://github.com/microsoft/UniSpeech/tre
|
|
32 |
This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
|
33 |
used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
|
34 |
|
35 |
-
See the following sections for more details on how to fine-tune the model.
|
36 |
-
|
37 |
**Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
|
38 |
of phonemes before fine-tuning.
|
39 |
|
|
|
23 |
Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
|
24 |
|
25 |
**Abstract**
|
26 |
+
*In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.*
|
27 |
|
28 |
The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
|
29 |
|
|
|
32 |
This is a multi-lingually pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be
|
33 |
used in inference. The model was pre-trained in English, Spanish, French, and Italian and should therefore perform well only in those or similar languages.
|
34 |
|
|
|
|
|
35 |
**Note**: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence
|
36 |
of phonemes before fine-tuning.
|
37 |
|