bofenghuang
/

wav2vec2-xls-r-1b-cv9-fr

@@ -60,110 +60,73 @@ model-index:
 # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
 This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.1430
-- Wer: 0.1245
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0001
-- train_batch_size: 16
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 8
-- total_train_batch_size: 128
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 10.0
-- mixed_precision_training: Native AMP
-### Training results
-| Training Loss | Epoch | Step  | Validation Loss | Wer    |
-|:-------------:|:-----:|:-----:|:---------------:|:------:|
-| 0.9229        | 0.14  | 500   | 0.5049          | 0.4008 |
-| 0.3823        | 0.28  | 1000  | 0.2831          | 0.2297 |
-| 0.3079        | 0.42  | 1500  | 0.2385          | 0.1951 |
-| 0.2899        | 0.55  | 2000  | 0.2273          | 0.1978 |
-| 0.2795        | 0.69  | 2500  | 0.2329          | 0.1983 |
-| 0.2863        | 0.83  | 3000  | 0.2289          | 0.1991 |
-| 0.3063        | 0.97  | 3500  | 0.2370          | 0.2046 |
-| 0.2766        | 1.11  | 4000  | 0.2322          | 0.2021 |
-| 0.2749        | 1.25  | 4500  | 0.2332          | 0.2055 |
-| 0.2769        | 1.39  | 5000  | 0.2322          | 0.2035 |
-| 0.2628        | 1.53  | 5500  | 0.2242          | 0.1948 |
-| 0.2614        | 1.66  | 6000  | 0.2303          | 0.1962 |
-| 0.2547        | 1.8   | 6500  | 0.2238          | 0.1920 |
-| 0.2458        | 1.94  | 7000  | 0.2186          | 0.1894 |
-| 0.231         | 2.08  | 7500  | 0.2169          | 0.1895 |
-| 0.2309        | 2.22  | 8000  | 0.2131          | 0.1870 |
-| 0.2258        | 2.36  | 8500  | 0.2133          | 0.1818 |
-| 0.2278        | 2.5   | 9000  | 0.2176          | 0.1878 |
-| 0.2263        | 2.63  | 9500  | 0.2030          | 0.1813 |
-| 0.2262        | 2.77  | 10000 | 0.2077          | 0.1824 |
-| 0.2228        | 2.91  | 10500 | 0.2115          | 0.1840 |
-| 0.2118        | 3.05  | 11000 | 0.2093          | 0.1782 |
-| 0.2073        | 3.19  | 11500 | 0.2004          | 0.1756 |
-| 0.2015        | 3.33  | 12000 | 0.1988          | 0.1748 |
-| 0.214         | 3.47  | 12500 | 0.2088          | 0.1816 |
-| 0.2075        | 3.61  | 13000 | 0.1976          | 0.1746 |
-| 0.2039        | 3.74  | 13500 | 0.1958          | 0.1744 |
-| 0.2003        | 3.88  | 14000 | 0.1931          | 0.1693 |
-| 0.1886        | 4.02  | 14500 | 0.1964          | 0.1686 |
-| 0.1943        | 4.16  | 15000 | 0.1986          | 0.1746 |
-| 0.1919        | 4.3   | 15500 | 0.1957          | 0.1700 |
-| 0.1857        | 4.44  | 16000 | 0.1907          | 0.1671 |
-| 0.1834        | 4.58  | 16500 | 0.1877          | 0.1641 |
-| 0.18          | 4.71  | 17000 | 0.1828          | 0.1600 |
-| 0.1774        | 4.85  | 17500 | 0.1863          | 0.1605 |
-| 0.1755        | 4.99  | 18000 | 0.1833          | 0.1595 |
-| 0.1692        | 5.13  | 18500 | 0.1814          | 0.1569 |
-| 0.1674        | 5.27  | 19000 | 0.1819          | 0.1566 |
-| 0.1664        | 5.41  | 19500 | 0.1805          | 0.1572 |
-| 0.1677        | 5.55  | 20000 | 0.1803          | 0.1560 |
-| 0.1637        | 5.68  | 20500 | 0.1750          | 0.1525 |
-| 0.1628        | 5.82  | 21000 | 0.1774          | 0.1532 |
-| 0.1645        | 5.96  | 21500 | 0.1744          | 0.1527 |
-| 0.1551        | 6.1   | 22000 | 0.1778          | 0.1543 |
-| 0.1505        | 6.24  | 22500 | 0.1754          | 0.1528 |
-| 0.1499        | 6.38  | 23000 | 0.1743          | 0.1500 |
-| 0.1491        | 6.52  | 23500 | 0.1684          | 0.1473 |
-| 0.1477        | 6.66  | 24000 | 0.1661          | 0.1472 |
-| 0.1456        | 6.79  | 24500 | 0.1654          | 0.1440 |
-| 0.1415        | 6.93  | 25000 | 0.1654          | 0.1448 |
-| 0.136         | 7.07  | 25500 | 0.1616          | 0.1407 |
-| 0.132         | 7.21  | 26000 | 0.1625          | 0.1410 |
-| 0.1323        | 7.35  | 26500 | 0.1604          | 0.1404 |
-| 0.1338        | 7.49  | 27000 | 0.1574          | 0.1386 |
-| 0.13          | 7.63  | 27500 | 0.1576          | 0.1384 |
-| 0.1291        | 7.76  | 28000 | 0.1551          | 0.1366 |
-| 0.1277        | 7.9   | 28500 | 0.1542          | 0.1356 |
-| 0.1241        | 8.04  | 29000 | 0.1545          | 0.1350 |
-| 0.1198        | 8.18  | 29500 | 0.1536          | 0.1322 |
-| 0.1204        | 8.32  | 30000 | 0.1547          | 0.1337 |
-| 0.1195        | 8.46  | 30500 | 0.1494          | 0.1309 |
-| 0.1169        | 8.6   | 31000 | 0.1490          | 0.1300 |
-| 0.1159        | 8.74  | 31500 | 0.1485          | 0.1305 |
-| 0.1142        | 8.87  | 32000 | 0.1479          | 0.1292 |
-| 0.1087        | 9.01  | 32500 | 0.1471          | 0.1284 |
-| 0.1076        | 9.15  | 33000 | 0.1467          | 0.1270 |
-| 0.1078        | 9.29  | 33500 | 0.1467          | 0.1270 |
-| 0.1073        | 9.43  | 34000 | 0.1447          | 0.1256 |
-| 0.108         | 9.57  | 34500 | 0.1447          | 0.1257 |
-| 0.106         | 9.71  | 35000 | 0.1438          | 0.1255 |
-| 0.1052        | 9.84  | 35500 | 0.1428          | 0.1247 |
-| 0.1044        | 9.98  | 36000 | 0.1430          | 0.1245 |
-### Framework versions
-- Transformers 4.22.0.dev0
-- Pytorch 1.12.0+cu113
-- Datasets 2.4.0
-- Tokenizers 0.12.1
 ## Evaluation

 # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
 This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.
+## Usage
+1. To use on a local audio file without the language model
+```python
+import torch
+import torchaudio
+from transformers import AutoModelForCTC, Wav2Vec2Processor
+processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
+model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
+# path to your audio file
+wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
+waveform, sample_rate = torchaudio.load(wav_path)
+waveform = waveform.squeeze(axis=0)  # mono
+# resample
+if sample_rate != 16_000:
+    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
+    waveform = resampler(waveform)
+# normalize
+input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")
+with torch.inference_mode():
+    logits = model(input_dict.input_values.to("cuda")).logits
+# decode
+predicted_ids = torch.argmax(logits, dim=-1)
+predicted_sentence = processor.batch_decode(predicted_ids)[0]
+```
+2. To use on a local audio file with the language model
+```python
+import torch
+import torchaudio
+from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
+processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
+model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()
+model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate
+# path to your audio file
+wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac"
+waveform, sample_rate = torchaudio.load(wav_path)
+waveform = waveform.squeeze(axis=0)  # mono
+# resample
+if sample_rate != 16_000:
+    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
+    waveform = resampler(waveform)
+# normalize
+input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")
+with torch.inference_mode():
+    logits = model(input_dict.input_values.to("cuda")).logits
+predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
+```
 ## Evaluation