nguyenvulebinh commited on
Commit
1b7c81a
1 Parent(s): 1d0fb34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -8
README.md CHANGED
@@ -22,31 +22,32 @@ widget:
22
 
23
  ### Model description
24
 
25
- [Our model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) was pre-trained on 13k hours of youtube (un-label data) and fine-tuned on 250 hours labeled of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio.
26
 
27
  We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
28
 
29
  >For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
30
 
31
- For fine-tuning phase, wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.
32
 
33
  | Model | #params | Pre-training data | Fine-tune data |
34
  |---|---|---|---|
35
  | [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
36
 
37
- In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model working as an acoustic model. For the language model, we provide a [4-grams model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/blob/main/vi_lm_4grams.bin.zip) trained on 2GB of spoken text.
38
 
39
 
40
- ### Benchmark WER result (with 4-grams LM):
41
 
42
- | [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
43
- |---|---|---|
44
- | 6.1 | 9.1 | 40.8 |
 
45
 
46
 
47
  ### Example usage
48
 
49
- When using the model make sure that your speech input is also sampled at 16Khz. Following Colab link below to use a combination of CTC-wav2vec and 4-grams LM.
50
 
51
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
52
 
@@ -82,6 +83,7 @@ logits = model(input_values).logits
82
  predicted_ids = torch.argmax(logits, dim=-1)
83
  transcription = processor.batch_decode(predicted_ids)
84
  ```
 
85
  # License
86
 
87
  This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
 
22
 
23
  ### Model description
24
 
25
+ [Our models](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio.
26
 
27
  We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
28
 
29
  >For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
30
 
31
+ For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.
32
 
33
  | Model | #params | Pre-training data | Fine-tune data |
34
  |---|---|---|---|
35
  | [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
36
 
37
+ In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model works as an acoustic model. For the language model, we provide a [4-grams model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/blob/main/vi_lm_4grams.bin.zip) trained on 2GB of spoken text.
38
 
39
 
40
+ ### Benchmark WER result:
41
 
42
+ | | [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
43
+ |---|---|---|---|
44
+ |without LM| 10.77 | 13.33 | 51.45 |
45
+ |with 4-grams LM| 6.15 | 9.11 | 40.81 |
46
 
47
 
48
  ### Example usage
49
 
50
+ When using the model make sure that your speech input is sampled at 16Khz. Audio length should be shorter than 10s. Following the Colab link below to use a combination of CTC-wav2vec and 4-grams LM.
51
 
52
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
53
 
 
83
  predicted_ids = torch.argmax(logits, dim=-1)
84
  transcription = processor.batch_decode(predicted_ids)
85
  ```
86
+
87
  # License
88
 
89
  This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.