---
license: mit
language: fr
datasets:
- mozilla-foundation/common_voice_13_0
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
model-index:
- name: Wav2Vec2-base French finetuned for phonemes by LMSSC
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice v13
      type: mozilla-foundation/common_voice_13_0
      args: fr
    metrics:
    - name: Test PER on Common Voice FR 13.0 | Trained
      type: per
      value: 5.52
    - name: Test PER on Multilingual Librispeech FR | Trained
      type: per
      value: 4.36
    - name: Val PER on Common Voice FR 13.0 | Trained 
      type: per
      value: 4.31
---

# Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French

Fine-tuned [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) for **French speech-to-phoneme** (without language model) using the train and validation splits of [Common Voice v13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).

## Audio samplerate for usage 

When using this model, make sure that your speech input is **sampled at 16kHz**.

## Training procedure

The model has been finetuned on Coommonvoice-v13 (FR) for 14 epochs on 4x2080 Ti GPUs using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch)

- Learning rate schedule : Double Tri-state schedule
    - Warmup from 1e-5 for 7% of total updates
    - Constant at 1e-4 for 28% of total updates
    - Linear decrease to 1e-6 for 36% of total updates
    - Second warmup boost to 3e-5 for 3% of total updates
    - Constant at 3e-5 for 12% of total updates
    - Linear decrease to 1e-7 for remaining 14% of updates
 
- The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of [wav2vec2 paper](https://arxiv.org/pdf/2006.11477.pdf).