File size: 2,039 Bytes

0dd5303
 
03b8829
 
7f8bf31
03b8829
 
 
 
 
 
45209c5
03b8829
45209c5
03b8829
45209c5
03b8829
 
 
29ef43f
 
03b8829
 
9953d4a
45209c5
 
9953d4a
45209c5
 
9953d4a
45209c5
 
03b8829
 
 
 
962644d
0e7f6a9
d404be6
4f7e2dd
 
 
d404be6
 
 
4f7e2dd
0e7f6a9
 
 
 
 
 
 
828efdc
d404be6
828efdc

---
license: mit
language: fr
datasets:
- mozilla-foundation/common_voice_13_0
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
model-index:
- name: Wav2Vec2-base French finetuned for phonemes by LMSSC
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice v13
      type: mozilla-foundation/common_voice_13_0
      args: fr
    metrics:
    - name: Test PER on Common Voice FR 13.0 | Trained
      type: per
      value: 5.52
    - name: Test PER on Multilingual Librispeech FR | Trained
      type: per
      value: 4.36
    - name: Val PER on Common Voice FR 13.0 | Trained 
      type: per
      value: 4.31
---

# Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French

Fine-tuned [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) for **French speech-to-phoneme** (without language model) using the train and validation splits of [Common Voice v13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).

## Audio samplerate for usage 

When using this model, make sure that your speech input is **sampled at 16kHz**.

## Training procedure

The model has been finetuned on Coommonvoice-v13 (FR) for 14 epochs on 4x2080 Ti GPUs using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch)

- Learning rate schedule : Double Tri-state schedule
    - Warmup from 1e-5 for 7% of total updates
    - Constant at 1e-4 for 28% of total updates
    - Linear decrease to 1e-6 for 36% of total updates
    - Second warmup boost to 3e-5 for 3% of total updates
    - Constant at 3e-5 for 12% of total updates
    - Linear decrease to 1e-7 for remaining 14% of updates
 
- The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of [wav2vec2 paper](https://arxiv.org/pdf/2006.11477.pdf).