Edit model card

Model Card

Output

As this model is specifically trained for a speech-to-phoneme task, the output is sequence of IPA-encoded words, without punctuation. If you don't read the phonetic alphabet fluently, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

Link to phonemizer models trained on other body conducted sensors :

An entry point to all phonemizers models trained on different sensor data from the Vibravox dataset is available at https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers.

Disclaimer

Each of these models has been trained for a specific non-conventional speech sensor and is intended to be used with in-domain data. The only exception is the headset microphone phonemizer, which can certainly be used for many applications using audio data captured by airborne microphones.

Please be advised that using these models outside their intended sensor data may result in suboptimal performance.

Training procedure

The model has been finetuned for 10 epochs with a constant learning rate of 1e-5. To reproduce experiment please visit jhauret/vibravox.

Inference script :

import torch, torchaudio
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("Cnam-LMSSC/phonemizer_soft_in_ear_microphone")
model = AutoModelForCTC.from_pretrained("Cnam-LMSSC/phonemizer_soft_in_ear_microphone")
test_dataset = load_dataset("Cnam-LMSSC/vibravox", "speech_clean", split="test", streaming=True)

audio_48kHz = torch.Tensor(next(iter(test_dataset))["audio.soft_in_ear_microphone"]["array"])
audio_16kHz = torchaudio.functional.resample(audio_48kHz, orig_freq=48_000, new_freq=16_000)

inputs = processor(audio_16kHz, sampling_rate=16_000, return_tensors="pt")
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)
Downloads last month
24
Safetensors
Model size
94.4M params
Tensor type
F32
·

Dataset used to train Cnam-LMSSC/phonemizer_soft_in_ear_microphone

Evaluation results

  • Test PER, in-domain training | on Vibravox["soft_in_ear_microphone"]
    self-reported
    4.000