dragonSwing commited on
Commit
4ec0637
1 Parent(s): 6818035

Update README and model file

Browse files
Files changed (7) hide show
  1. 4gram.zip +2 -2
  2. README.md +38 -21
  3. example.mp3 +0 -0
  4. example.wav +0 -0
  5. example2.mp3 +0 -0
  6. hyperparams.yaml +1 -1
  7. model.ckpt +1 -1
4gram.zip CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b44b6f17af4baf24dcd93f2d411d1664e32b1c3cdd2a1f07458c53fd02e6f487
3
- size 2773083070
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6e5c67796f2399c116073286a0870f141b4ddf1b6a75723c139c77d21114d55
3
+ size 2481196955
README.md CHANGED
@@ -13,12 +13,10 @@ tags:
13
  - Transformer
14
  license: cc-by-nc-4.0
15
  widget:
16
- - example_title: VLSP ASR 2020 test T1
17
- src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
18
- - example_title: VLSP ASR 2020 test T1
19
- src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
20
- - example_title: VLSP ASR 2020 test T2
21
- src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
22
  model-index:
23
  - name: Wav2vec2 Base Vietnamese 270h
24
  results:
@@ -33,6 +31,28 @@ model-index:
33
  - name: Test WER
34
  type: wer
35
  value: 9.66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  - task:
37
  name: Speech Recognition
38
  type: automatic-speech-recognition
@@ -43,7 +63,7 @@ model-index:
43
  metrics:
44
  - name: Test WER
45
  type: wer
46
- value: 4.04
47
  ---
48
  # Wav2Vec2-Base-Vietnamese-270h
49
  Fine-tuned Wav2Vec2 model on Vietnamese Speech Recognition task using about 270h labelled data combined from multiple datasets including [Common Voice](https://huggingface.co/datasets/common_voice), [VIVOS](https://huggingface.co/datasets/vivos), [VLSP2020](https://vlsp.org.vn/vlsp2020/eval/asr). The model was fine-tuned using SpeechBrain toolkit with a custom tokenizer. For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io/).
@@ -51,19 +71,15 @@ When using this model, make sure that your speech input is sampled at 16kHz.
51
  Please refer to [huggingface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) or [speechbrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) on how to fine-tune Wav2Vec2 model on a specific language.
52
 
53
  ### Benchmark WER result:
54
- | | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE VI](https://huggingface.co/datasets/common_voice) |
55
- |---|---|---|
56
- |without LM| 8.41 | 17.82 |
57
- |with 4-grams LM| 4.04 | 9.66 |
58
 
59
  The language model was trained using [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) dataset on about 32GB of crawled text.
60
 
61
  ### Install SpeechBrain
62
- To use this model, you should install speechbrain from source. This is not required for speechbrain version > 0.5.10
63
-
64
- ```bash
65
- pip install git+https://github.com/speechbrain/speechbrain.git@develop
66
- ```
67
 
68
  ### Usage
69
  The model can be used directly (without a language model) as follows:
@@ -71,14 +87,15 @@ The model can be used directly (without a language model) as follows:
71
  from speechbrain.pretrained import EncoderASR
72
 
73
  model = EncoderASR.from_hparams(source="dragonSwing/wav2vec2-base-vn-270h", savedir="pretrained_models/asr-wav2vec2-vi")
74
- model.transcribe_file('dragonSwing/wav2vec2-base-vn-270h/example.wav')
 
75
  ```
76
 
77
  ### Inference on GPU
78
  To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
79
 
80
  ### Evaluation
81
- The model can be evaluated as follows on the Vietnamese test data of Common Voice.
82
  ```python
83
  import torch
84
  import torchaudio
@@ -86,7 +103,7 @@ from datasets import load_dataset, load_metric, Audio
86
  from transformers import Wav2Vec2FeatureExtractor
87
  from speechbrain.pretrained import EncoderASR
88
  import re
89
- test_dataset = load_dataset("common_voice", "vi", split="test")
90
  test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16_000))
91
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
92
  wer = load_metric("wer")
@@ -116,10 +133,10 @@ def evaluate(batch):
116
  batch["pred_strings"] = pred_str
117
 
118
  return batch
119
- result = test_dataset.map(evaluate, batched=True, batch_size=4)
120
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["target_text"])))
121
  ```
122
- **Test Result**: 17.817680%
123
 
124
  #### Citation
125
  ```
13
  - Transformer
14
  license: cc-by-nc-4.0
15
  widget:
16
+ - example_title: Example 1
17
+ src: https://huggingface.co/dragonSwing/wav2vec2-base-vn-270h/raw/main/example.mp3
18
+ - example_title: Example 2
19
+ src: https://huggingface.co/dragonSwing/wav2vec2-base-vn-270h/raw/main/example2.mp3
 
 
20
  model-index:
21
  - name: Wav2vec2 Base Vietnamese 270h
22
  results:
31
  - name: Test WER
32
  type: wer
33
  value: 9.66
34
+ - task:
35
+ name: Speech Recognition
36
+ type: automatic-speech-recognition
37
+ dataset:
38
+ name: Common Voice 7.0
39
+ type: mozilla-foundation/common_voice_7_0
40
+ args: vi
41
+ metrics:
42
+ - name: Test WER
43
+ type: wer
44
+ value: 5.57
45
+ - task:
46
+ name: Speech Recognition
47
+ type: automatic-speech-recognition
48
+ dataset:
49
+ name: Common Voice 8.0
50
+ type: mozilla-foundation/common_voice_8_0
51
+ args: vi
52
+ metrics:
53
+ - name: Test WER
54
+ type: wer
55
+ value: 5.76
56
  - task:
57
  name: Speech Recognition
58
  type: automatic-speech-recognition
63
  metrics:
64
  - name: Test WER
65
  type: wer
66
+ value: 3.70
67
  ---
68
  # Wav2Vec2-Base-Vietnamese-270h
69
  Fine-tuned Wav2Vec2 model on Vietnamese Speech Recognition task using about 270h labelled data combined from multiple datasets including [Common Voice](https://huggingface.co/datasets/common_voice), [VIVOS](https://huggingface.co/datasets/vivos), [VLSP2020](https://vlsp.org.vn/vlsp2020/eval/asr). The model was fine-tuned using SpeechBrain toolkit with a custom tokenizer. For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io/).
71
  Please refer to [huggingface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) or [speechbrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) on how to fine-tune Wav2Vec2 model on a specific language.
72
 
73
  ### Benchmark WER result:
74
+ | | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE 7.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0) | [COMMON VOICE 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) |
75
+ |---|---|---|---|
76
+ |without LM| 8.23 | 12.15 | 12.15 |
77
+ |with 4-grams LM| 3.70 | 5.57 | 5.76 |
78
 
79
  The language model was trained using [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) dataset on about 32GB of crawled text.
80
 
81
  ### Install SpeechBrain
82
+ To use this model, you should install speechbrain > 0.5.10
 
 
 
 
83
 
84
  ### Usage
85
  The model can be used directly (without a language model) as follows:
87
  from speechbrain.pretrained import EncoderASR
88
 
89
  model = EncoderASR.from_hparams(source="dragonSwing/wav2vec2-base-vn-270h", savedir="pretrained_models/asr-wav2vec2-vi")
90
+ model.transcribe_file('dragonSwing/wav2vec2-base-vn-270h/example.mp3')
91
+ # Output: được hồ chí minh coi là một động lực lớn của sự phát triển đất nước
92
  ```
93
 
94
  ### Inference on GPU
95
  To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
96
 
97
  ### Evaluation
98
+ The model can be evaluated as follows on the Vietnamese test data of Common Voice 8.0.
99
  ```python
100
  import torch
101
  import torchaudio
103
  from transformers import Wav2Vec2FeatureExtractor
104
  from speechbrain.pretrained import EncoderASR
105
  import re
106
+ test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token=True)
107
  test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16_000))
108
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
109
  wer = load_metric("wer")
133
  batch["pred_strings"] = pred_str
134
 
135
  return batch
136
+ result = test_dataset.map(evaluate, batched=True, batch_size=1)
137
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["target_text"])))
138
  ```
139
+ **Test Result**: 12.155553%
140
 
141
  #### Citation
142
  ```
example.mp3 ADDED
Binary file (11.8 kB). View file
example.wav DELETED
Binary file (49.6 kB)
example2.mp3 ADDED
Binary file (10.5 kB). View file
hyperparams.yaml CHANGED
@@ -1,5 +1,5 @@
1
  # ################################
2
- # Model: wav2vec2 + DNN + CTC/Attention
3
  # Augmentation: SpecAugment
4
  # Authors: Le Do Thanh Binh 2021
5
  # ################################
1
  # ################################
2
+ # Model: wav2vec2 + CTC
3
  # Augmentation: SpecAugment
4
  # Authors: Le Do Thanh Binh 2021
5
  # ################################
model.ckpt CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e315a64b704fff992630eccd824c2780ec79c346b2c64518ee9b7845af03a65c
3
  size 379749523
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f28211bbcf163899adc748d90c1b40b481a6c785b1e71785f90e7e2a95c8e78
3
  size 379749523