Model Card

Developed by: Cnam-LMSSC
Model type: Wav2Vec2ForCTC
Language: French
License: MIT
Finetuned from model: facebook/wav2vec2-base-fr-voxpopuli-v2
Finetuned dataset: airborne.mouth_headworn.reference_microphone audio of the speech_clean subset of Cnam-LMSSC/vibravox
Samplerate for usage: 16kHz

Output

As this model is specifically trained for a speech-to-phoneme task, the output is sequence of IPA-encoded words, without punctuation. If you don't read the phonetic alphabet fluently, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

Training procedure

The model has been finetuned for 10 epochs with a constant learning rate of 1e-5. To reproduce experiment please visit jhauret/vibravox.

Inference script (if you do not want to use the huggingsound library) :

import torch, torchaudio
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("Cnam-LMSSC/phonemizer_airborne.mouth_headworn.reference_microphone")
model = AutoModelForCTC.from_pretrained("Cnam-LMSSC/phonemizer_airborne.mouth_headworn.reference_microphone")
test_dataset = load_dataset("Cnam-LMSSC/vibravox", "speech_clean", split="test", streaming=True)

audio_48kHz = torch.Tensor(next(iter(test_dataset))["audio.airborne.mouth_headworn.reference_microphone"]["array"])
audio_16kHz = torchaudio.functional.resample(audio_48kHz, orig_freq=48_000, new_freq=16_000)

inputs = processor(audio_16kHz, sampling_rate=16_000, return_tensors="pt")
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)

Test Results:

In the table below, we report the Phoneme Error Rate (PER) of the model on the several microphones and subset of Vibravox:

Test Set	PER
Vibravox/speech_clean/airborne.mouth_headworn.reference_microphone	2.874%
Vibravox/speech_clean/body_conducted.forehead.miniature_accelerometer	??%
Vibravox/speech_clean/body_conducted.in_ear.comply_foam_microphone	??%
Vibravox/speech_clean/body_conducted.in_ear.rigid_earpiece_microphone	??%
Vibravox/speech_clean/body_conducted.throat.piezoelectric_sensor	??%
Vibravox/speech_clean/body_conducted.temple.contact_microphone	??%
Vibravox/speech_noisy/airborne.mouth_headworn.reference_microphone	??%
Vibravox/speech_noisy/body_conducted.forehead.miniature_accelerometer	??%
Vibravox/speech_noisy/body_conducted.in_ear.comply_foam_microphone	??%
Vibravox/speech_noisy/body_conducted.in_ear.rigid_earpiece_microphone	??%
Vibravox/speech_noisy/body_conducted.throat.piezoelectric_sensor	??%
Vibravox/speech_noisy/body_conducted.temple.contact_microphone	??%

Cnam-LMSSC
/

phonemizer_airborne.mouth_headworn.reference_microphone

Model Card

Output

Training procedure

Inference script (if you do not want to use the huggingsound library) :

Test Results:

Dataset used to train Cnam-LMSSC/phonemizer_airborne.mouth_headworn.reference_microphone

Collection including Cnam-LMSSC/phonemizer_airborne.mouth_headworn.reference_microphone

VibraVox : French Speech Dataset and Models

Evaluation results