---
license: mit
datasets:
- Cnam-LMSSC/vibravox
language:
- fr
---
# Master Model Card: Vibravox Speech-to-Phonemes Models

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/65302a613ecbe51d6a6ddcec/zhB1fh-c0pjlj-Tr4Vpmr.png" style="object-fit:contain; width:280px; height:280px;" >
</p>

## Overview

This master model card serves as an entry point for exploring [multiple speech-to-phoneme models](https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers#available-models) trained on different sensor data from the [Vibravox dataset](https://huggingface.co/datasets/Cnam-LMSSC/vibravox). 

These models are designed to convert French speech into sequences of International Phonetic Alphabet (IPA) encoded words, and are fine-tuned on specific sensors to address various audio capture scenarios using **body conducted** sound and vibration sensors.

## Disclaimer
Each of these models has been trained for **specific non-conventional speech sensors** and is intended to be used with **in-domain data**. The only exception is the headset microphone phonemizer, which can certainly be used for many applications using audio data captured by airborne microphones. 

Please be advised that using these models outside their intended sensor data may result in suboptimal performance.

## Task Description
The primary task for these models is an ASR task in the speech-to-phoneme context. Each model takes audio input and outputs a sequence of phonemes encoded in the IPA, facilitating precise phonetic transcription of French speech. Users unfamiliar with the phonetic alphabet can use tools like the [IPA reader](http://ipa-reader.xyz) to convert the transcript back to synthetic speech and evaluate the transcription quality.

## Usage
All models are finetuned versions of [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) and adapted to different sensor inputs. They are intended to be used at a sample rate of 16kHz. 

## Training Procedure
The models were each finetuned for 10 epochs with a constant learning rate of 1e-5. Detailed instructions for reproducing the experiments are available on the [jhauret/vibravox](https://github.com/jhauret/vibravox) Github repository and in the [VibraVox paper on arXiV](https://arxiv.org/abs/2407.11828).

## Available Models

The following models are available, **each trained on a different sensor** on the `speech_clean` subset of (https://huggingface.co/datasets/Cnam-LMSSC/vibravox):

| **Transducer**                 | **Huggingface model link**  |
|:---------------------------|:---------------------|
| Reference headset microphone | [phonemizer_headset_microphone](https://huggingface.co/Cnam-LMSSC/phonemizer_headset_microphone) | 
| In-ear comply foam-embedded microphone |[phonemizer_soft_in_ear_microphone](https://huggingface.co/Cnam-LMSSC/phonemizer_soft_in_ear_microphone) |
| In-ear rigid earpiece-embedded microphone | [phonemizer_rigid_in_ear_microphone](https://huggingface.co/Cnam-LMSSC/phonemizer_rigid_in_ear_microphone) |
| Forehead miniature vibration sensor | [phonemizer_forehead_accelerometer](https://huggingface.co/Cnam-LMSSC/phonemizer_forehead_accelerometer) |
| Temple vibration pickup | [phonemizer_temple_vibration_pickup](https://huggingface.co/Cnam-LMSSC/phonemizer_temple_vibration_pickup) |
| Laryngophone | [phonemizer_throat_microphone](https://huggingface.co/Cnam-LMSSC/phonemizer_throat_microphone) |