bofenghuang commited on
Commit
fe8883a
1 Parent(s): 9f3e1fa

updt README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -104
README.md CHANGED
@@ -60,110 +60,73 @@ model-index:
60
  # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
61
 
62
  This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
63
- It achieves the following results on the evaluation set:
64
- - Loss: 0.1430
65
- - Wer: 0.1245
66
-
67
- ## Training procedure
68
-
69
- ### Training hyperparameters
70
-
71
- The following hyperparameters were used during training:
72
- - learning_rate: 0.0001
73
- - train_batch_size: 16
74
- - eval_batch_size: 8
75
- - seed: 42
76
- - gradient_accumulation_steps: 8
77
- - total_train_batch_size: 128
78
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
79
- - lr_scheduler_type: linear
80
- - lr_scheduler_warmup_ratio: 0.1
81
- - num_epochs: 10.0
82
- - mixed_precision_training: Native AMP
83
-
84
- ### Training results
85
-
86
- | Training Loss | Epoch | Step | Validation Loss | Wer |
87
- |:-------------:|:-----:|:-----:|:---------------:|:------:|
88
- | 0.9229 | 0.14 | 500 | 0.5049 | 0.4008 |
89
- | 0.3823 | 0.28 | 1000 | 0.2831 | 0.2297 |
90
- | 0.3079 | 0.42 | 1500 | 0.2385 | 0.1951 |
91
- | 0.2899 | 0.55 | 2000 | 0.2273 | 0.1978 |
92
- | 0.2795 | 0.69 | 2500 | 0.2329 | 0.1983 |
93
- | 0.2863 | 0.83 | 3000 | 0.2289 | 0.1991 |
94
- | 0.3063 | 0.97 | 3500 | 0.2370 | 0.2046 |
95
- | 0.2766 | 1.11 | 4000 | 0.2322 | 0.2021 |
96
- | 0.2749 | 1.25 | 4500 | 0.2332 | 0.2055 |
97
- | 0.2769 | 1.39 | 5000 | 0.2322 | 0.2035 |
98
- | 0.2628 | 1.53 | 5500 | 0.2242 | 0.1948 |
99
- | 0.2614 | 1.66 | 6000 | 0.2303 | 0.1962 |
100
- | 0.2547 | 1.8 | 6500 | 0.2238 | 0.1920 |
101
- | 0.2458 | 1.94 | 7000 | 0.2186 | 0.1894 |
102
- | 0.231 | 2.08 | 7500 | 0.2169 | 0.1895 |
103
- | 0.2309 | 2.22 | 8000 | 0.2131 | 0.1870 |
104
- | 0.2258 | 2.36 | 8500 | 0.2133 | 0.1818 |
105
- | 0.2278 | 2.5 | 9000 | 0.2176 | 0.1878 |
106
- | 0.2263 | 2.63 | 9500 | 0.2030 | 0.1813 |
107
- | 0.2262 | 2.77 | 10000 | 0.2077 | 0.1824 |
108
- | 0.2228 | 2.91 | 10500 | 0.2115 | 0.1840 |
109
- | 0.2118 | 3.05 | 11000 | 0.2093 | 0.1782 |
110
- | 0.2073 | 3.19 | 11500 | 0.2004 | 0.1756 |
111
- | 0.2015 | 3.33 | 12000 | 0.1988 | 0.1748 |
112
- | 0.214 | 3.47 | 12500 | 0.2088 | 0.1816 |
113
- | 0.2075 | 3.61 | 13000 | 0.1976 | 0.1746 |
114
- | 0.2039 | 3.74 | 13500 | 0.1958 | 0.1744 |
115
- | 0.2003 | 3.88 | 14000 | 0.1931 | 0.1693 |
116
- | 0.1886 | 4.02 | 14500 | 0.1964 | 0.1686 |
117
- | 0.1943 | 4.16 | 15000 | 0.1986 | 0.1746 |
118
- | 0.1919 | 4.3 | 15500 | 0.1957 | 0.1700 |
119
- | 0.1857 | 4.44 | 16000 | 0.1907 | 0.1671 |
120
- | 0.1834 | 4.58 | 16500 | 0.1877 | 0.1641 |
121
- | 0.18 | 4.71 | 17000 | 0.1828 | 0.1600 |
122
- | 0.1774 | 4.85 | 17500 | 0.1863 | 0.1605 |
123
- | 0.1755 | 4.99 | 18000 | 0.1833 | 0.1595 |
124
- | 0.1692 | 5.13 | 18500 | 0.1814 | 0.1569 |
125
- | 0.1674 | 5.27 | 19000 | 0.1819 | 0.1566 |
126
- | 0.1664 | 5.41 | 19500 | 0.1805 | 0.1572 |
127
- | 0.1677 | 5.55 | 20000 | 0.1803 | 0.1560 |
128
- | 0.1637 | 5.68 | 20500 | 0.1750 | 0.1525 |
129
- | 0.1628 | 5.82 | 21000 | 0.1774 | 0.1532 |
130
- | 0.1645 | 5.96 | 21500 | 0.1744 | 0.1527 |
131
- | 0.1551 | 6.1 | 22000 | 0.1778 | 0.1543 |
132
- | 0.1505 | 6.24 | 22500 | 0.1754 | 0.1528 |
133
- | 0.1499 | 6.38 | 23000 | 0.1743 | 0.1500 |
134
- | 0.1491 | 6.52 | 23500 | 0.1684 | 0.1473 |
135
- | 0.1477 | 6.66 | 24000 | 0.1661 | 0.1472 |
136
- | 0.1456 | 6.79 | 24500 | 0.1654 | 0.1440 |
137
- | 0.1415 | 6.93 | 25000 | 0.1654 | 0.1448 |
138
- | 0.136 | 7.07 | 25500 | 0.1616 | 0.1407 |
139
- | 0.132 | 7.21 | 26000 | 0.1625 | 0.1410 |
140
- | 0.1323 | 7.35 | 26500 | 0.1604 | 0.1404 |
141
- | 0.1338 | 7.49 | 27000 | 0.1574 | 0.1386 |
142
- | 0.13 | 7.63 | 27500 | 0.1576 | 0.1384 |
143
- | 0.1291 | 7.76 | 28000 | 0.1551 | 0.1366 |
144
- | 0.1277 | 7.9 | 28500 | 0.1542 | 0.1356 |
145
- | 0.1241 | 8.04 | 29000 | 0.1545 | 0.1350 |
146
- | 0.1198 | 8.18 | 29500 | 0.1536 | 0.1322 |
147
- | 0.1204 | 8.32 | 30000 | 0.1547 | 0.1337 |
148
- | 0.1195 | 8.46 | 30500 | 0.1494 | 0.1309 |
149
- | 0.1169 | 8.6 | 31000 | 0.1490 | 0.1300 |
150
- | 0.1159 | 8.74 | 31500 | 0.1485 | 0.1305 |
151
- | 0.1142 | 8.87 | 32000 | 0.1479 | 0.1292 |
152
- | 0.1087 | 9.01 | 32500 | 0.1471 | 0.1284 |
153
- | 0.1076 | 9.15 | 33000 | 0.1467 | 0.1270 |
154
- | 0.1078 | 9.29 | 33500 | 0.1467 | 0.1270 |
155
- | 0.1073 | 9.43 | 34000 | 0.1447 | 0.1256 |
156
- | 0.108 | 9.57 | 34500 | 0.1447 | 0.1257 |
157
- | 0.106 | 9.71 | 35000 | 0.1438 | 0.1255 |
158
- | 0.1052 | 9.84 | 35500 | 0.1428 | 0.1247 |
159
- | 0.1044 | 9.98 | 36000 | 0.1430 | 0.1245 |
160
-
161
- ### Framework versions
162
-
163
- - Transformers 4.22.0.dev0
164
- - Pytorch 1.12.0+cu113
165
- - Datasets 2.4.0
166
- - Tokenizers 0.12.1
167
 
168
 
169
  ## Evaluation
 
60
  # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
61
 
62
  This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
63
+
64
+
65
+ ## Usage
66
+
67
+ 1. To use on a local audio file without the language model
68
+
69
+ ```python
70
+ import torch
71
+ import torchaudio
72
+
73
+ from transformers import AutoModelForCTC, Wav2Vec2Processor
74
+
75
+ processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
76
+ model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
77
+
78
+ # path to your audio file
79
+ wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
80
+ waveform, sample_rate = torchaudio.load(wav_path)
81
+ waveform = waveform.squeeze(axis=0) # mono
82
+
83
+ # resample
84
+ if sample_rate != 16_000:
85
+ resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
86
+ waveform = resampler(waveform)
87
+
88
+ # normalize
89
+ input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")
90
+
91
+ with torch.inference_mode():
92
+ logits = model(input_dict.input_values.to("cuda")).logits
93
+
94
+ # decode
95
+ predicted_ids = torch.argmax(logits, dim=-1)
96
+ predicted_sentence = processor.batch_decode(predicted_ids)[0]
97
+ ```
98
+
99
+ 2. To use on a local audio file with the language model
100
+
101
+ ```python
102
+ import torch
103
+ import torchaudio
104
+
105
+ from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
106
+
107
+ processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
108
+ model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
109
+
110
+ model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate
111
+
112
+ # path to your audio file
113
+ wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
114
+ waveform, sample_rate = torchaudio.load(wav_path)
115
+ waveform = waveform.squeeze(axis=0) # mono
116
+
117
+ # resample
118
+ if sample_rate != 16_000:
119
+ resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
120
+ waveform = resampler(waveform)
121
+
122
+ # normalize
123
+ input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")
124
+
125
+ with torch.inference_mode():
126
+ logits = model(input_dict.input_values.to("cuda")).logits
127
+
128
+ predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
129
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
 
132
  ## Evaluation