bofenghuang commited on
Commit
2f4ae8e
1 Parent(s): 9ae2cb9

updt README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -45
README.md CHANGED
@@ -79,55 +79,78 @@ model-index:
79
  # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
80
 
81
  This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the POLINAETERNA/VOXPOPULI - FR dataset.
82
- It achieves the following results on the evaluation set:
83
- - Loss: 0.2906
84
- - Wer: 0.1093
85
-
86
- ## Training procedure
87
-
88
- ### Training hyperparameters
89
-
90
- The following hyperparameters were used during training:
91
- - learning_rate: 0.0001
92
- - train_batch_size: 16
93
- - eval_batch_size: 8
94
- - seed: 42
95
- - gradient_accumulation_steps: 8
96
- - total_train_batch_size: 128
97
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
98
- - lr_scheduler_type: linear
99
- - lr_scheduler_warmup_ratio: 0.1
100
- - num_epochs: 12.0
101
- - mixed_precision_training: Native AMP
102
-
103
- ### Training results
104
-
105
- | Training Loss | Epoch | Step | Validation Loss | Wer |
106
- |:-------------:|:-----:|:----:|:---------------:|:------:|
107
- | 0.4628 | 0.93 | 500 | 0.3834 | 0.1625 |
108
- | 0.3577 | 1.85 | 1000 | 0.3231 | 0.1367 |
109
- | 0.3103 | 2.78 | 1500 | 0.2918 | 0.1287 |
110
- | 0.2884 | 3.7 | 2000 | 0.2845 | 0.1227 |
111
- | 0.2615 | 4.63 | 2500 | 0.2819 | 0.1189 |
112
- | 0.242 | 5.56 | 3000 | 0.2915 | 0.1165 |
113
- | 0.2268 | 6.48 | 3500 | 0.2768 | 0.1187 |
114
- | 0.2188 | 7.41 | 4000 | 0.2719 | 0.1128 |
115
- | 0.1979 | 8.33 | 4500 | 0.2741 | 0.1134 |
116
- | 0.1834 | 9.26 | 5000 | 0.2827 | 0.1096 |
117
- | 0.1719 | 10.19 | 5500 | 0.2906 | 0.1093 |
118
- | 0.1723 | 11.11 | 6000 | 0.2868 | 0.1104 |
119
-
120
- ### Framework versions
121
-
122
- - Transformers 4.23.0.dev0
123
- - Pytorch 1.12.0+cu113
124
- - Datasets 2.4.0
125
- - Tokenizers 0.12.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
 
128
  ## Evaluation
129
 
130
- 1. To evaluate on `mozilla-foundation/common_voice_9_0`
131
 
132
  ```bash
133
  python eval.py \
 
79
  # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
80
 
81
  This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the POLINAETERNA/VOXPOPULI - FR dataset.
82
+
83
+
84
+ ## Usage
85
+
86
+ 1. To use on a local audio file without the language model
87
+
88
+ ```python
89
+ import torch
90
+ import torchaudio
91
+
92
+ from transformers import AutoModelForCTC, Wav2Vec2Processor
93
+
94
+ processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-voxpopuli-fr")
95
+ model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-voxpopuli-fr").cuda()
96
+
97
+ # path to your audio file
98
+ wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
99
+ waveform, sample_rate = torchaudio.load(wav_path)
100
+ waveform = waveform.squeeze(axis=0) # mono
101
+
102
+ # resample
103
+ if sample_rate != 16_000:
104
+ resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
105
+ waveform = resampler(waveform)
106
+
107
+ # normalize
108
+ input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")
109
+
110
+ with torch.inference_mode():
111
+ logits = model(input_dict.input_values.to("cuda")).logits
112
+
113
+ # decode
114
+ predicted_ids = torch.argmax(logits, dim=-1)
115
+ predicted_sentence = processor.batch_decode(predicted_ids)[0]
116
+ ```
117
+
118
+ 2. To use on a local audio file with the language model
119
+
120
+ ```python
121
+ import torch
122
+ import torchaudio
123
+
124
+ from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
125
+
126
+ processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-voxpopuli-fr")
127
+ model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-voxpopuli-fr").cuda()
128
+
129
+ model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate
130
+
131
+ # path to your audio file
132
+ wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
133
+ waveform, sample_rate = torchaudio.load(wav_path)
134
+ waveform = waveform.squeeze(axis=0) # mono
135
+
136
+ # resample
137
+ if sample_rate != 16_000:
138
+ resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
139
+ waveform = resampler(waveform)
140
+
141
+ # normalize
142
+ input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")
143
+
144
+ with torch.inference_mode():
145
+ logits = model(input_dict.input_values.to("cuda")).logits
146
+
147
+ predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
148
+ ```
149
 
150
 
151
  ## Evaluation
152
 
153
+ 1. To evaluate on `polinaeterna/voxpopuli`
154
 
155
  ```bash
156
  python eval.py \