File size: 3,904 Bytes
6e1a7c6
 
 
 
 
 
 
 
6721a6f
 
6e1a7c6
9af87e8
6721a6f
3ed3687
6e1a7c6
1b3874c
6721a6f
 
4f6a589
3ed3687
6721a6f
 
 
 
 
3ed3687
6721a6f
3ed3687
 
 
 
f6042fc
 
3ed3687
f6042fc
 
 
 
 
3ed3687
f6042fc
3ed3687
 
f6042fc
3ed3687
6e1a7c6
 
 
1b3874c
6e1a7c6
 
fe8883a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6dbf4c2
fe8883a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6dbf4c2
fe8883a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb99554
 
1b3874c
 
 
 
 
 
bb99554
1b3874c
 
 
bb99554
 
1b3874c
 
 
 
 
 
bb99554
1b3874c
 
 
 
 
bb99554
 
1b3874c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language:
- fr
license: apache-2.0
tags:
- automatic-speech-recognition
- mozilla-foundation/common_voice_9_0
- generated_from_trainer
- hf-asr-leaderboard
- robust-speech-event
datasets:
- common_voice
- mozilla-foundation/common_voice_9_0
base_model: facebook/wav2vec2-xls-r-1b
model-index:
- name: Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 9
      type: mozilla-foundation/common_voice_9_0
      args: fr
    metrics:
    - type: wer
      value: 12.72
      name: Test WER
    - type: wer
      value: 10.6
      name: Test WER (+LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Robust Speech Event - Dev Data
      type: speech-recognition-community-v2/dev_data
      args: fr
    metrics:
    - type: wer
      value: 24.28
      name: Test WER
    - type: wer
      value: 20.85
      name: Test WER (+LM)
---


# Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset.


## Usage

1. To use on a local audio file without the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()

# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != 16_000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to("cuda")).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
```

2. To use on a local audio file with the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr")
model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda()

model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate

# path to your audio file
wav_path = "example.wav"
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != 16_000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to("cuda")).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
```


## Evaluation

1. To evaluate on `mozilla-foundation/common_voice_9_0`

```bash
python eval.py \
  --model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
  --dataset "mozilla-foundation/common_voice_9_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_9_0_with_lm"
```

2. To evaluate on `speech-recognition-community-v2/dev_data`

```bash
python eval.py \
  --model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 5.0 \
  --stride_length_s 1.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
```