File size: 5,021 Bytes
61a33e1
 
83b8be4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d8e28b
61a33e1
 
83b8be4
61a33e1
83b8be4
 
 
 
 
 
 
61a33e1
45e3f6d
 
 
 
83b8be4
61a33e1
83b8be4
 
61a33e1
69c1898
 
 
 
 
 
 
 
 
83b8be4
61a33e1
83b8be4
61a33e1
69c1898
61a33e1
83b8be4
 
 
 
61a33e1
83b8be4
 
 
61a33e1
83b8be4
 
61a33e1
83b8be4
 
 
 
61a33e1
83b8be4
 
61a33e1
83b8be4
61a33e1
83b8be4
61a33e1
83b8be4
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
library_name: transformers
license: mit
language: fr
datasets:
- Cnam-LMSSC/vibravox
metrics:
- per
tags:
- audio
- automatic-speech-recognition
- speech
- phonemize
- phoneme
model-index:
- name: Wav2Vec2-base French finetuned for Speech-to-Phoneme by LMSSC
  results:
  - task:
      name: Speech-to-Phoneme
      type: automatic-speech-recognition
    dataset:
      name: Vibravox["body_conducted.in_ear.rigid_earpiece_microphone"]
      type: Cnam-LMSSC/vibravox
      args: fr
    metrics:
    - name: Test PER on Vibravox["body_conducted.in_ear.rigid_earpiece_microphone"] | Trained
      type: per
      value: 3.998
---

# Model Card 

- **Developed by:** [Cnam-LMSSC](https://huggingface.co/Cnam-LMSSC)
- **Model type:** [Wav2Vec2ForCTC](https://huggingface.co/transformers/v4.9.2/model_doc/wav2vec2.html#transformers.Wav2Vec2ForCTC)
- **Language:** French
- **License:** MIT
- **Finetuned from model:** [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2)
- **Finetuned dataset:** `body_conducted.in_ear.rigid_earpiece_microphone` audio of the `speech_clean` subset of [Cnam-LMSSC/vibravox](https://huggingface.co/datasets/Cnam-LMSSC/vibravox)
- **Samplerate for usage:** 16kHz

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6390fc80e6d656eb421bab69/KkZoQQmrn53U6BTLmr0XK.png" />
</p>

## Output

As this model is specifically trained for a speech-to-phoneme task, the output is sequence of [IPA-encoded](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) words, without punctuation.
If you don't read the phonetic alphabet fluently, you can use this excellent [IPA reader website](http://ipa-reader.xyz) to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription.

## Link to other phonemizer models trained on other body conducted sensors : 

An entry point to all **phonemizers** (speech-to-phoneme ASR) models trained on different sensor data from the trained on different sensor data from the [Vibravox dataset](https://huggingface.co/datasets/Cnam-LMSSC/vibravox) is available at [https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers](https://huggingface.co/Cnam-LMSSC/vibravox_phonemizers).  

### Disclaimer
Each of these models has been trained for a **specific non-conventional speech sensor** and is intended to be used with **in-domain data**. The only exception is the headset microphone phonemizer, which can certainly be used for many applications using audio data captured by airborne microphones.

Please be advised that using these models outside their intended sensor data may result in suboptimal performance.

## Training procedure

The model has been finetuned for 10 epochs with a constant learning rate of *1e-5*. To reproduce experiment please visit [jhauret/vibravox](https://github.com/jhauret/vibravox).

## Inference script : 

```python
import torch, torchaudio
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("Cnam-LMSSC/body_conducted.in_ear.rigid_earpiece_microphone")
model = AutoModelForCTC.from_pretrained("Cnam-LMSSC/body_conducted.in_ear.rigid_earpiece_microphone")
test_dataset = load_dataset("Cnam-LMSSC/vibravox", "speech_clean", split="test", streaming=True)

audio_48kHz = torch.Tensor(next(iter(test_dataset))["audio.body_conducted.in_ear.rigid_earpiece_microphone"]["array"])
audio_16kHz = torchaudio.functional.resample(audio_48kHz, orig_freq=48_000, new_freq=16_000)

inputs = processor(audio_16kHz, sampling_rate=16_000, return_tensors="pt")
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)
```

## Test Results:

In the table below, we report the Phoneme Error Rate (PER) of the model on the several microphones and subset of Vibravox:

| Test Set  | PER |
| ------------- | ------------- |
| Vibravox/speech_clean/airborne.mouth_headworn.reference_microphone | **??%** |
| Vibravox/speech_clean/body_conducted.forehead.miniature_accelerometer | **??%** |
| Vibravox/speech_clean/body_conducted.in_ear.comply_foam_microphone | **??%** |
| Vibravox/speech_clean/body_conducted.in_ear.rigid_earpiece_microphone | **3.998%** |
| Vibravox/speech_clean/body_conducted.throat.piezoelectric_sensor | **??%** |
| Vibravox/speech_clean/body_conducted.temple.contact_microphone | **??%** |
| Vibravox/speech_noisy/airborne.mouth_headworn.reference_microphone | **??%** |
| Vibravox/speech_noisy/body_conducted.forehead.miniature_accelerometer | **??%** |
| Vibravox/speech_noisy/body_conducted.in_ear.comply_foam_microphone | **??%** |
| Vibravox/speech_noisy/body_conducted.in_ear.rigid_earpiece_microphone | **??%** |
| Vibravox/speech_noisy/body_conducted.throat.piezoelectric_sensor | **??%** |
| Vibravox/speech_noisy/body_conducted.temple.contact_microphone | **??%** |