File size: 2,472 Bytes
4932285
 
1cdb121
 
 
 
 
4932285
1cdb121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: mit
language: fr
datasets:
- mozilla-foundation/common_voice_13_0
tags:
- automatic-speech-recognition
---

# Wav2vec2-CTC-based French Phonemizer

## Usage

*Infer audio*

```python
import soundfile as sf
import torch
from transformers import AutoModelForCTC, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/phonemizer-wav2vec2-ctc-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model_sample_rate = processor.feature_extractor.sampling_rate
model = AutoModelForCTC.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
)

# Example audio
audio_file_path = "/path/to/example/wav/file"

# Infer with pipeline
result = pipe(audio_file_path)
print(result["text"])

# Infer w/ lower-level api
waveform, sample_rate = sf.read(audio_file_path, start=0, frames=-1, dtype="float32", always_2d=False)

input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    input_values = input_dict.input_values.to(device, dtype=torch_dtype)
    logits = model(input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_text = processor.batch_decode(predicted_ids)[0]
print(predicted_text)
```



*Phonemes were generated using the following code snippet:*

```python
# !pip install phonemizer
from phonemizer.backend import EspeakBackend
from phonemizer.separator import Separator

# initialize the espeak backend for French
backend = EspeakBackend("fr-fr", language_switch="remove-flags")

# separate phones by a space and ignoring words boundaries
separator = Separator(phone=None, word=" ", syllable="")

def phonemize_text_phonemizer(s):
    return backend.phonemize([s], separator=separator, strip=True, njobs=1)[0]

input_str = "ce modèle est utilisé pour identifier les phonèmes dans l'audio entrant"
print(phonemize_text_phonemizer(input_str))
# 'sə modɛl ɛt ytilize puʁ idɑ̃tifje le fonɛm dɑ̃ lodjo ɑ̃tʁɑ̃'
```

## Acknowledgement

Inspired by [Cnam-LMSSC/wav2vec2-french-phonemizer](https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer)