add vibravox paper on arXiV

7c4bef9 verified 4 months ago

3.74 kB

	---
	library_name: transformers
	license: mit
	language: fr
	datasets:
	- Cnam-LMSSC/vibravox
	metrics:
	- per
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- phonemize
	- phoneme
	model-index:
	- name: Wav2Vec2-base French finetuned for Speech-to-Phoneme by LMSSC
	results:
	- task:
	name: Speech-to-Phoneme
	type: automatic-speech-recognition
	dataset:
	name: Vibravox["forehead_accelerometer"]
	type: Cnam-LMSSC/vibravox
	args: fr
	metrics:
	- name: Test PER, in-domain training \|
	type: per
	value: 4.5
	---

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/65302a613ecbe51d6a6ddcec/zhB1fh-c0pjlj-Tr4Vpmr.png" style="object-fit:contain; width:280px; height:280px;" >
	</p>

	# Model Card

	- Developed by: [Cnam-LMSSC](https://huggingface.co/Cnam-LMSSC)
	- Model type: [Wav2Vec2ForCTC](https://huggingface.co/transformers/v4.9.2/model_doc/wav2vec2.html#transformers.Wav2Vec2ForCTC)
	- Language: French
	- License: MIT
	- Finetuned from model: [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2)
	- Finetuned dataset: `forehead_accelerometer` audio of the `speech_clean` subset of [Cnam-LMSSC/vibravox](https://huggingface.co/datasets/Cnam-LMSSC/vibravox) (see [VibraVox paper on arXiV](https://arxiv.org/abs/2407.11828))
	- Samplerate for usage: 16kHz

	## Output

	As this model is specifically trained for a speech-to-phoneme task, the output is sequence of [IPA-encoded](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) words, without punctuation.
	If you don't read the phonetic alphabet fluently, you can use this excellent [IPA reader website](http://ipa-reader.xyz) to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

	## Link to phonemizer models trained on other body conducted sensors :

	An entry point to all phonemizers models trained on different sensor data from the [Vibravox dataset](https://huggingface.co/datasets/Cnam-LMSSC/vibravox) is available at [https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers](https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers).

	### Disclaimer
	Each of these models has been trained for a specific non-conventional speech sensor and is intended to be used with in-domain data. The only exception is the headset microphone phonemizer, which can certainly be used for many applications using audio data captured by airborne microphones.

	Please be advised that using these models outside their intended sensor data may result in suboptimal performance.

	## Training procedure

	The model has been finetuned for 10 epochs with a constant learning rate of 1e-5. To reproduce experiment please visit [jhauret/vibravox](https://github.com/jhauret/vibravox).

	## Inference script :

	```python
	import torch, torchaudio
	from transformers import AutoProcessor, AutoModelForCTC
	from datasets import load_dataset

	processor = AutoProcessor.from_pretrained("Cnam-LMSSC/phonemizer_forehead_accelerometer")
	model = AutoModelForCTC.from_pretrained("Cnam-LMSSC/phonemizer_forehead_accelerometer")
	test_dataset = load_dataset("Cnam-LMSSC/vibravox", "speech_clean", split="test", streaming=True)

	audio_48kHz = torch.Tensor(next(iter(test_dataset))["audio.forehead_accelerometer"]["array"])
	audio_16kHz = torchaudio.functional.resample(audio_48kHz, orig_freq=48_000, new_freq=16_000)

	inputs = processor(audio_16kHz, sampling_rate=16_000, return_tensors="pt")
	logits = model(inputs.input_values).logits
	predicted_ids = torch.argmax(logits,dim = -1)
	transcription = processor.batch_decode(predicted_ids)

	print("Phonetic transcription : ", transcription)
	```