File size: 4,617 Bytes

14359da
 
 
 
 
 
 
08ee2f5
 
14359da
 
 
7685ae9
14359da
38cc73b
14359da
f09e2da
14359da
7685ae9
14359da
 
 
48d7d91
 
7685ae9
48d7d91
14359da
7685ae9
f09e2da
7685ae9
 
f09e2da
7685ae9
 
f09e2da
7685ae9
 
f09e2da
7685ae9
f09e2da
3f9fa35
7685ae9
3f9fa35
 
 
 
 
7685ae9
f09e2da
7685ae9
 
f09e2da
7685ae9
 
f09e2da
7685ae9
 
f09e2da
7685ae9
f09e2da
 
7685ae9
f09e2da
 
 
 
 
7685ae9
f09e2da
7685ae9
14359da
 
161497a
14359da
6ce253e
6cdc194
14359da
cbcda8d
6cdc194
cbcda8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14359da
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1d5fd
 
 
 
 
161497a
0c94710
3c1d5fd
 
 
 
24ee097

---
language:
- en
license: apache-2.0
tags:
- automatic-speech-recognition
- en
- hf-asr-leaderboard
- mozilla-foundation/common_voice_8_0
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_8_0
base_model: facebook/wav2vec2-xls-r-1b
model-index:
- name: XLS-R Wav2Vec2 English by Jonatas Grosman
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 8
      type: mozilla-foundation/common_voice_8_0
      config: en
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 21.05
      name: Test WER
    - type: cer
      value: 8.44
      name: Test CER
    - type: wer
      value: 17.31
      name: Test WER (+LM)
    - type: cer
      value: 7.77
      name: Test CER (+LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Robust Speech Event - Dev Data
      type: speech-recognition-community-v2/dev_data
      args: en
    metrics:
    - type: wer
      value: 20.53
      name: Dev WER
    - type: cer
      value: 9.31
      name: Dev CER
    - type: wer
      value: 17.7
      name: Dev WER (+LM)
    - type: cer
      value: 8.93
      name: Dev CER (+LM)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Robust Speech Event - Test Data
      type: speech-recognition-community-v2/eval_data
      args: en
    metrics:
    - type: wer
      value: 17.88
      name: Test WER
---

# Fine-tuned XLS-R 1B model for speech recognition in English

Fine-tuned [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on English using the train and validation splits of [Common Voice 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0), [Multilingual LibriSpeech](https://www.openslr.org/94/), [TED-LIUMv3](https://www.openslr.org/51/), and [Voxpopuli](https://github.com/facebookresearch/voxpopuli).
When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned by the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) tool, and thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)

## Usage

Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:

```python
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)
```

Writing your own inference script:

```python
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
```

## Evaluation Commands

1. To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`

```bash
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-english --dataset mozilla-foundation/common_voice_8_0 --config en --split test
```

2. To evaluate on `speech-recognition-community-v2/dev_data`

```bash
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0
```

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{grosman2021xlsr-1b-english,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-english}},
  year={2022}
}
```