jonatasgrosman commited on
Commit
ea576d0
1 Parent(s): 30c5623

update README

Browse files
Files changed (1) hide show
  1. README.md +16 -5
README.md CHANGED
@@ -49,7 +49,7 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
49
 
50
  LANG_ID = "fr"
51
  MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
52
- SAMPLES = 5
53
 
54
  test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
55
 
@@ -86,6 +86,11 @@ for i, predicted_sentence in enumerate(predicted_sentences):
86
  | "J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES." | JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES |
87
  | LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS. | LE PAYS-BAS AN REMPORTAIT TOUTES LES ÉDITIONS |
88
  | IL Y A MAINTENANT UNE GARE ROUTIÈRE. | IL A MA ANDIN GARD DETIRON |
 
 
 
 
 
89
 
90
  ## Evaluation
91
 
@@ -102,9 +107,11 @@ LANG_ID = "fr"
102
  MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
103
  DEVICE = "cuda"
104
 
105
- CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
106
  "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
107
- "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。"]
 
 
108
 
109
  test_dataset = load_dataset("common_voice", LANG_ID, split="test")
110
 
@@ -152,11 +159,15 @@ print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_
152
 
153
  **Test Result**:
154
 
155
- In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-21). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
156
 
157
  | Model | WER | CER |
158
  | ------------- | ------------- | ------------- |
159
  | jonatasgrosman/wav2vec2-large-xlsr-53-french | **16.86%** | **5.65%** |
160
- | Ilyes/wav2vec2-large-xlsr-53-french | 24.76% | 8.06% |
 
 
161
  | facebook/wav2vec2-large-xlsr-53-french | 25.45% | 10.35% |
162
  | MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French | 28.22% | 9.70% |
 
 
 
49
 
50
  LANG_ID = "fr"
51
  MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
52
+ SAMPLES = 10
53
 
54
  test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
55
 
 
86
  | "J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES." | JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES |
87
  | LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS. | LE PAYS-BAS AN REMPORTAIT TOUTES LES ÉDITIONS |
88
  | IL Y A MAINTENANT UNE GARE ROUTIÈRE. | IL A MA ANDIN GARD DETIRON |
89
+ | HUIT | HUIT |
90
+ | DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION | DANS L'ATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE D'UNE VIVE ÉMOTION |
91
+ | LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES. | LA PREMIÈRE SAISON EST COMPOSÉE DE DOUX ÉPISODES |
92
+ | ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES. | ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES |
93
+ | ZÉRO | ZÉRO |
94
 
95
  ## Evaluation
96
 
 
107
  MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
108
  DEVICE = "cuda"
109
 
110
+ CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
111
  "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
112
+ "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
113
+ "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
114
+ "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
115
 
116
  test_dataset = load_dataset("common_voice", LANG_ID, split="test")
117
 
 
159
 
160
  **Test Result**:
161
 
162
+ In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-05-16). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
163
 
164
  | Model | WER | CER |
165
  | ------------- | ------------- | ------------- |
166
  | jonatasgrosman/wav2vec2-large-xlsr-53-french | **16.86%** | **5.65%** |
167
+ | Ilyes/wav2vec2-large-xlsr-53-french | 19.67% | 6.70% |
168
+ | jonatasgrosman/wav2vec2-large-fr-voxpopuli-french | 19.80% | 6.89% |
169
+ | Nhut/wav2vec2-large-xlsr-french | 24.09% | 8.42% |
170
  | facebook/wav2vec2-large-xlsr-53-french | 25.45% | 10.35% |
171
  | MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-French | 28.22% | 9.70% |
172
+ | Ilyes/wav2vec2-large-xlsr-53-french_punctuation | 29.80% | 11.79% |
173
+ | facebook/wav2vec2-base-10k-voxpopuli-ft-fr | 61.06% | 33.31% |