wannaphong commited on
Commit
6d6c57f
1 Parent(s): 7196b1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -4
README.md CHANGED
@@ -11,22 +11,66 @@ metrics:
11
  - cer
12
  ---
13
 
14
- # Thai CommonVoice V8 (newmm tokenizer)
15
 
16
  This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th). It was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
17
 
 
 
18
  ## Datasets
19
 
20
  It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.
21
 
22
  It use [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) script for split Common Voice dataset.
23
 
 
 
24
  ## Models
25
 
26
- This model was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model with Thai Common Voice V8 dataset and It use pre-tokenize with pythainlp.tokenize.word_tokenize.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  **Links:**
30
  - GitHub Dataset: [https://github.com/wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)
31
-
32
- [WIP]
 
11
  - cer
12
  ---
13
 
14
+ # Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer)
15
 
16
  This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th). It was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
17
 
18
+ GitHub: [https://github.com/wannaphong/thai-wav2vec2-cv-v8](https://github.com/wannaphong/thai-wav2vec2-cv-v8)
19
+
20
  ## Datasets
21
 
22
  It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.
23
 
24
  It use [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) script for split Common Voice dataset.
25
 
26
+ You can read more at [wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)
27
+
28
  ## Models
29
 
30
+ This model was finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model with Thai Common Voice V8 dataset and It use pre-tokenize with `pythainlp.tokenize.word_tokenize`.
31
+
32
+ ## Training
33
+
34
+ I used many code from [vistec-AI/wav2vec2-large-xlsr-53-th](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th) and I fixed bug training code in [vistec-AI/wav2vec2-large-xlsr-53-th#2](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th/pull/2)
35
+
36
+ ## Evaluation
37
+
38
+ **Test with CommonVoice V8 Testset**
39
+
40
+ | Model | WER by newmm (%) | WER by deepcut (%) | CER | URL |
41
+ |-----------------------|------------------|--------------------|----------|-------------------------------------------------------------|
42
+ | wav2vec2 with deepcut | 16.354521 | 11.424476 | 3.684060 | https://github.com/wannaphong/th-cv-v8-wav2vev2-deepcut |
43
+ | wav2vec2 with newmm | 16.698299 | 11.436941 | 3.737407 | https://github.com/wannaphong/thai-wav2vec2-cv-v8 |
44
+ | CV v7 | 17.414503 | 11.923089 | 3.854153 | https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th |
45
 
46
+ **Test with CommonVoice V7 Testset (same test by CV V7)**
47
+
48
+ | Model | WER by newmm (%) | WER by deepcut (%) | CER | URL |
49
+ |-----------------------|------------------|--------------------|----------|-------------------------------------------------------------|
50
+ | wav2vec2 with deepcut | 12.776381 | 8.773006 | 2.628882 | https://github.com/wannaphong/th-cv-v8-wav2vev2-deepcut |
51
+ | wav2vec2 with newmm | 12.750596 | 8.672616 | 2.623341 | https://github.com/wannaphong/thai-wav2vec2-cv-v8 |
52
+ | CV v7 | 13.936698 | 2.804787 | 2.804787 | https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th |
53
+
54
+
55
+ This is use same testset from [https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).
56
+
57
+ source code benchmark: https://github.com/wannaphong/thai-asr-benchmark/tree/main/commonvoice
58
+
59
+
60
+ ## Files
61
+ - `0-download-unzip.ipynb` - notebook for download and unzip CommonVoice V8
62
+ - `1-convert-mp3-wav.ipynb` - notebook for convert mp3 files to wav files
63
+ - `1-preprocessing-thai-cv-v8-wav2vev2.ipynb` - notebook for preprocessing CommonVoice V8 (old file)
64
+ - `2-gen-val-json.py` - python file for get manifest in nvidia meno asr
65
+ - `2-preprocessing-thai-cv-v8-wav2vev2.ipynb` - notebook for preprocessing CommonVoice V8
66
+ - `4-gen-manifest.ipynb` - notebook for get manifest in nvidia meno asr
67
+ - `build-lm.ipynb` - notebook for build ASR LM
68
+ - `test-ai4thai.ipynb` - notebook for test AI For Thai.
69
+ - `test-google.ipynb` - notebook for test Google ASR.
70
+ - `test-v7.ipynb` - notebook for test [vistec-AI/wav2vec2-large-xlsr-53-th](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th) model.
71
+ - `test-wav2vec2-lm.ipynb` - notebook for test our model with LM.
72
+ - `test-wav2vec2.ipynb` - notebook for test our model without LM.
73
+ - `train-wav2vec2.py` - python file for training model.
74
 
75
  **Links:**
76
  - GitHub Dataset: [https://github.com/wannaphong/thai_commonvoice_dataset](https://github.com/wannaphong/thai_commonvoice_dataset)