File size: 6,296 Bytes
0dd5303
 
03b8829
 
7f8bf31
03b8829
 
 
 
 
 
45209c5
ceb97a4
03b8829
45209c5
03b8829
45209c5
03b8829
 
 
29ef43f
 
03b8829
 
9953d4a
45209c5
 
9953d4a
45209c5
 
9953d4a
45209c5
 
03b8829
 
 
 
962644d
0e7f6a9
d404be6
4f7e2dd
 
 
4129c9e
 
 
 
 
d404be6
 
6d7c3a1
4f7e2dd
0e7f6a9
 
 
 
 
 
 
828efdc
d404be6
828efdc
58d7821
328454b
2e11032
328454b
58d7821
e6839c5
 
 
 
a6b4918
e6839c5
 
e3f3945
a6b4918
e6839c5
4140733
 
4185dba
e6839c5
a6b4918
 
f98d2fd
a6b4918
 
 
 
 
e6839c5
 
f98d2fd
a382071
f98d2fd
 
 
 
aae3cae
58d7821
be28782
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90b2124
be28782
e715906
be28782
1655a0f
be28782
1655a0f
 
 
be28782
 
1655a0f
be28782
 
204136b
b60af79
204136b
 
 
 
 
be28782
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: mit
language: fr
datasets:
- mozilla-foundation/common_voice_13_0
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
- phoneme
model-index:
- name: Wav2Vec2-base French finetuned for phonemes by LMSSC
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice v13
      type: mozilla-foundation/common_voice_13_0
      args: fr
    metrics:
    - name: Test PER on Common Voice FR 13.0 | Trained
      type: per
      value: 5.52
    - name: Test PER on Multilingual Librispeech FR | Trained
      type: per
      value: 4.36
    - name: Val PER on Common Voice FR 13.0 | Trained 
      type: per
      value: 4.31
---

# Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French

Fine-tuned [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) for **French speech-to-phoneme** (without language model) using the train and validation splits of [Common Voice v13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).

## Audio samplerate for usage 

When using this model, make sure that your speech input is **sampled at 16kHz**.

## Output

As this model is specifically trained for a speech-to-phoneme task, the output is sequence of [IPA-encoded](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) words, without punctuation.
If you don't read the phonetic alphabet fluently, you can use this excellent [IPA reader website](http://ipa-reader.xyz) to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

## Training procedure

The model has been finetuned on Commonvoice-v13 (FR) for 14 epochs on a 4x2080 Ti GPUs at Cnam/LMMSC using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch)

- Learning rate schedule : Double Tri-state schedule
    - Warmup from 1e-5 for 7% of total updates
    - Constant at 1e-4 for 28% of total updates
    - Linear decrease to 1e-6 for 36% of total updates
    - Second warmup boost to 3e-5 for 3% of total updates
    - Constant at 3e-5 for 12% of total updates
    - Linear decrease to 1e-7 for remaining 14% of updates
 
- The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of [wav2vec2 paper](https://arxiv.org/pdf/2006.11477.pdf).

## Usage (using the online Inference API)

Just record your voice on the ⚡ Inference API on this webpage, and then click on "Compute", that's all ! 

## Usage (with HuggingSound library)

The model can be used directly using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:

```python
import pandas as pd
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-french-phonemizer")
audio_paths = ["./test_relecture_texte.wav", "./10179_11051_000021.flac"]

# No need for the Audio files to be sampled at 16 kHz here,
# they are automatically resampled by Huggingsound

transcriptions = model.transcribe(audio_paths)

# (Optionnal) Display results in a table :
## transcriptions is list of dicts also containing timestamps and probabilities !

df = pd.DataFrame(transcriptions)
df['Audio file'] = pd.DataFrame(audio_paths)
df.set_index('Audio file', inplace=True)
df[['transcription']]
```

**Output** : 

| **Audio file**                 | **Phonetic transcription (IPA)**                                                                                                                                           |
|:---------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------|
| ./test_relecture_texte.wav | ʃapitʁ di də abɛse pəti kɔ̃t də ʒyl ləmɛtʁ ɑ̃ʁʒistʁe puʁ libʁivɔksɔʁɡ ibis dɑ̃ la bas kuʁ dœ̃ ʃato sə tʁuva paʁmi tut sɔʁt də volaj œ̃n ibis ʁɔz             |
| ./10179_11051_000021.flac  | kɛl dɔmaʒ kə sə nə swa pa dy sykʁ supiʁa se foʁaz ɑ̃ pasɑ̃ sa lɑ̃ɡ syʁ la vitʁ fɛ̃ dy ʃapitʁ kɛ̃z ɑ̃ʁʒistʁe paʁ sonjɛ̃ sɛt ɑ̃ʁʒistʁəmɑ̃ fɛ paʁti dy domɛn pyblik |

## Inference script (if you do not want to use the huggingsound library) : 

```python
import torch
from transformers import AutoModelForCTC, Wav2Vec2Processor
from datasets import load_dataset
import soundfile as sf # Or Librosa if you prefer to ... 

MODEL_ID = "Cnam-LMSSC/wav2vec2-french-phonemizer"

model = AutoModelForCTC.from_pretrained(MODEL_ID)
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)

audio = sf.read('example.wav')
# Make sure you have a 16 kHz sampled audio file, or resample it !

inputs = processor(np.array(audio[0]),sampling_rate=16_000., return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)
```

**Output** : 

'ʒə syi tʁɛ kɔ̃tɑ̃ də vu pʁezɑ̃te notʁ solysjɔ̃ puʁ fonomize dez odjo fasilmɑ̃ sa fɔ̃ksjɔn kɑ̃ mɛm tʁɛ bjɛ̃'

## Test Results:

In the table below, we report the Phoneme Error Rate (PER) of the model on both Common Voice and Multilingual Librispeech (using the French configs for both datasets of course), when finetuned on Common Voice train set only : 

| Model | Test Set  | PER |
| ------------- | ------------- | ------------- |
| Cnam-LMSSC/wav2vec2-french-phonemizer | Common Voice v13 (French) | **5.52%** |
| Cnam-LMSSC/wav2vec2-french-phonemizer | Multilingual Librispeech (French) | **4.36%** |


## Citation
If you use this finetuned model for any publication, please use this to cite our work :

```bibtex
@misc {lmssc-wav2vec2-base-phonemizer-french_2023,
	author       = { Olivier, Malo AND Hauret, Julien AND Bavu, {É}ric },
	title        = { wav2vec2-french-phonemizer (Revision e715906) },
	year         = 2023,
	url          = { https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer },
	doi          = { 10.57967/hf/1339 },
	publisher    = { Hugging Face }
}
```