metadata
library_name: transformers
license: mit
language: fr
datasets:
- Cnam-LMSSC/vibravox
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
- phoneme
model-index:
- name: Wav2Vec2-base French finetuned for Speech-to-Phoneme by LMSSC
results:
- task:
name: Speech-to-Phoneme
type: automatic-speech-recognition
dataset:
name: Vibravox["body_conducted.in_ear.rigid_earpiece_microphone"]
type: Cnam-LMSSC/vibravox
args: fr
metrics:
- name: >-
Test PER on
Vibravox["body_conducted.in_ear.rigid_earpiece_microphone"] |
Trained
type: per
value: 3.998
Model Card
- Developed by: Cnam-LMSSC
- Model type: Wav2Vec2ForCTC
- Language: French
- License: MIT
- Finetuned from model: facebook/wav2vec2-base-fr-voxpopuli-v2
- Finetuned dataset:
body_conducted.in_ear.rigid_earpiece_microphone
audio of thespeech_clean
subset of Cnam-LMSSC/vibravox - Samplerate for usage: 16kHz
Output
As this model is specifically trained for a speech-to-phoneme task, the output is sequence of IPA-encoded words, without punctuation. If you don't read the phonetic alphabet fluently, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.
Training procedure
The model has been finetuned for 10 epochs with a constant learning rate of 1e-5. To reproduce experiment please visit jhauret/vibravox.
Inference script (if you do not want to use the huggingsound library) :
import torch, torchaudio
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("Cnam-LMSSC/body_conducted.in_ear.rigid_earpiece_microphone")
model = AutoModelForCTC.from_pretrained("Cnam-LMSSC/body_conducted.in_ear.rigid_earpiece_microphone")
test_dataset = load_dataset("Cnam-LMSSC/vibravox", "speech_clean", split="test", streaming=True)
audio_48kHz = torch.Tensor(next(iter(test_dataset))["audio.body_conducted.in_ear.rigid_earpiece_microphone"]["array"])
audio_16kHz = torchaudio.functional.resample(audio_48kHz, orig_freq=48_000, new_freq=16_000)
inputs = processor(audio_16kHz, sampling_rate=16_000, return_tensors="pt")
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)
print("Phonetic transcription : ", transcription)
Test Results:
In the table below, we report the Phoneme Error Rate (PER) of the model on the several microphones and subset of Vibravox:
Test Set | PER |
---|---|
Vibravox/speech_clean/airborne.mouth_headworn.reference_microphone | ??% |
Vibravox/speech_clean/body_conducted.forehead.miniature_accelerometer | ??% |
Vibravox/speech_clean/body_conducted.in_ear.comply_foam_microphone | ??% |
Vibravox/speech_clean/body_conducted.in_ear.rigid_earpiece_microphone | 3.998% |
Vibravox/speech_clean/body_conducted.throat.piezoelectric_sensor | ??% |
Vibravox/speech_clean/body_conducted.temple.contact_microphone | ??% |
Vibravox/speech_noisy/airborne.mouth_headworn.reference_microphone | ??% |
Vibravox/speech_noisy/body_conducted.forehead.miniature_accelerometer | ??% |
Vibravox/speech_noisy/body_conducted.in_ear.comply_foam_microphone | ??% |
Vibravox/speech_noisy/body_conducted.in_ear.rigid_earpiece_microphone | ??% |
Vibravox/speech_noisy/body_conducted.throat.piezoelectric_sensor | ??% |
Vibravox/speech_noisy/body_conducted.temple.contact_microphone | ??% |