File size: 8,567 Bytes

84c2d70
 
d5950ce
84c2d70
 
 
 
 
 
 
 
 
 
 
 
 
d5950ce
84c2d70
 
 
 
 
d5950ce
84c2d70
 
 
 
 
 
 
d5950ce
84c2d70
d5950ce
 
84c2d70
d5950ce
84c2d70
 
d5950ce
84c2d70
 
 
 
 
 
 
d5950ce
 
 
 
84c2d70
d5950ce
84c2d70
 
d5950ce
84c2d70
 
 
 
 
 
 
d5950ce
 
 
 
84c2d70
d5950ce
84c2d70
 
d5950ce
84c2d70
 
 
 
 
 
 
d5950ce
84c2d70
d5950ce
 
84c2d70
d5950ce
84c2d70
 
d5950ce
84c2d70
 
 
 
 
 
 
d5950ce
84c2d70
d5950ce
 
84c2d70
d5950ce
84c2d70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f3b39d6
 
84c2d70

---
language: fr
license: apache-2.0
library_name: transformers
tags:
- automatic-speech-recognition
- hf-asr-leaderboard
- whisper-event
datasets:
- mozilla-foundation/common_voice_11_0
- facebook/multilingual_librispeech
- facebook/voxpopuli
- google/fleurs
- gigant/african_accented_french
metrics:
- wer
base_model: openai/whisper-large-v2
model-index:
- name: Fine-tuned whisper-large-v2 model for ASR in French
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      config: fr
      split: test
      args: fr
    metrics:
    - type: wer
      value: 8.15
      name: WER (Greedy)
    - type: wer
      value: 7.83
      name: WER (Beam 5)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Multilingual LibriSpeech (MLS)
      type: facebook/multilingual_librispeech
      config: french
      split: test
      args: french
    metrics:
    - type: wer
      value: 4.2
      name: WER (Greedy)
    - type: wer
      value: 4.03
      name: WER (Beam 5)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli
      type: facebook/voxpopuli
      config: fr
      split: test
      args: fr
    metrics:
    - type: wer
      value: 9.1
      name: WER (Greedy)
    - type: wer
      value: 8.66
      name: WER (Beam 5)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Fleurs
      type: google/fleurs
      config: fr_fr
      split: test
      args: fr_fr
    metrics:
    - type: wer
      value: 5.22
      name: WER (Greedy)
    - type: wer
      value: 4.98
      name: WER (Beam 5)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: African Accented French
      type: gigant/african_accented_french
      config: fr
      split: test
      args: fr
    metrics:
    - type: wer
      value: 4.58
      name: WER (Greedy)
    - type: wer
      value: 4.31
      name: WER (Beam 5)
---

<style>
img {
 display: inline;
}
</style>

![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey)
![Model size](https://img.shields.io/badge/Params-1550M-lightgrey)
![Language](https://img.shields.io/badge/Language-French-lightgrey)

# Fine-tuned whisper-large-v2 model for ASR in French

This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and the validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is sampled at 16Khz. **This model doesn't predict casing or punctuation.**

## Performance

*Below are the WERs of the pre-trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli) and [Fleurs](https://huggingface.co/datasets/google/fleurs). These results are reported in the original [paper](https://cdn.openai.com/papers/whisper.pdf).*

| Model | Common Voice 9.0 | MLS | VoxPopuli | Fleurs |
| --- | :---: | :---: | :---: | :---: |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 22.7 | 16.2 | 15.7 | 15.0 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 16.0 | 8.9 | 12.2 | 8.7 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 14.7 | 8.9 | **11.0** | **7.7** |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | **13.9** | **7.3** | 11.4 | 8.3 |

*Below are the WERs of the fine-tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), and [Fleurs](https://huggingface.co/datasets/google/fleurs). Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as `WER (greedy search) / WER (beam search with beam width 5)`.*

| Model | Common Voice 11.0 | MLS | VoxPopuli | Fleurs |
| --- | :---: | :---: | :---: | :---: |
| [bofenghuang/whisper-small-cv11-french](https://huggingface.co/bofenghuang/whisper-small-cv11-french) | 11.76 / 10.99 | 9.65 / 8.91 | 14.45 / 13.66 | 10.76 / 9.83 |
| [bofenghuang/whisper-medium-cv11-french](https://huggingface.co/bofenghuang/whisper-medium-cv11-french) | 9.03 / 8.54 | 6.34 / 5.86 | 11.64 / 11.35 | 7.13 / 6.85 |
| [bofenghuang/whisper-medium-french](https://huggingface.co/bofenghuang/whisper-medium-french) | 9.03 / 8.73 | 4.60 / 4.44 | 9.53 / 9.46 | 6.33 / 5.94 |
| [bofenghuang/whisper-large-v2-cv11-french](https://huggingface.co/bofenghuang/whisper-large-v2-cv11-french) | **8.05** / **7.67** | 5.56 / 5.28 | 11.50 / 10.69 | 5.42 / 5.05 |
| [bofenghuang/whisper-large-v2-french](https://huggingface.co/bofenghuang/whisper-large-v2-french) | 8.15 / 7.83 | **4.20** / **4.03** | **9.10** / **8.66** | **5.22** / **4.98** |

## Usage

Inference with 🤗 Pipeline

```python
import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-large-v2-french", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Run
generated_sentences = pipe(waveform, max_new_tokens=225)["text"]  # greedy
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"]  # beam search

# Normalise predicted sentences if necessary
```

Inference with 🤗 low-level APIs

```python
import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-large-v2-french").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-large-v2-french", language="french", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary
```