File size: 3,470 Bytes
b4c1aec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
tags:
- audio
- text-to-speech
- onnx
inference: false
language: en
license: apache-2.0
library_name: txtai
---
# SpeechT5 Text-to-Speech (TTS) Model for ONNX
Fine-tuned version of [SpeechT5 TTS](https://huggingface.co/microsoft/speecht5_tts) exported to ONNX. This model was exported to ONNX using the [Optimum](https://github.com/huggingface/optimum) library.
## Usage with txtai
[txtai](https://github.com/neuml/txtai) has a built in Text to Speech (TTS) pipeline that makes using this model easy.
_Note the following example requires txtai >= 7.5_
```python
import soundfile as sf
from txtai.pipeline import TextToSpeech
# Build pipeline
tts = TextToSpeech("NeuML/txtai-speecht5-onnx")
# Generate speech
speech, rate = tts("Say something here")
# Write to file
sf.write("out.wav", speech, rate)
# Generate speech with custom speaker
speech, rate = tts("Say something here", speaker=np.array(...))
```
## Model training
This model was fine-tuned using the code in this [Hugging Face article](https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning) and a custom set of WAV files.
The ONNX export uses the following code, which requires installing `optimum`.
```python
import os
from optimum.exporters.onnx import main_export
from optimum.onnx import merge_decoders
# Params
model = "txtai-speecht5-tts"
output = "txtai-speecht5-onnx"
# ONNX Export
main_export(
task="text-to-audio",
model_name_or_path=model,
model_kwargs={
"vocoder": "microsoft/speecht5_hifigan"
},
output = output
)
# Merge into single decoder model
merge_decoders(
f"{output}/decoder_model.onnx",
f"{output}/decoder_with_past_model.onnx",
save_path=f"{output}/decoder_model_merged.onnx",
strict=False
)
# Remove unnecessary files
os.remove(f"{output}/decoder_model.onnx")
os.remove(f"{output}/decoder_with_past_model.onnx")
```
## Custom speaker embeddings
When no speaker argument is passed in, the default speaker embeddings are used. The defaults speaker is David Mezzetti, the primary developer of txtai.
It's possible to build custom speaker embeddings as shown below. Fine-tuning the model with a new voice leads to the best results but zero-shot speaker embeddings are OK in some cases.
The following code requires installing `torchaudio` and `speechbrain`.
```python
import os
import numpy as np
import torchaudio
from speechbrain.inference import EncoderClassifier
def speaker(path):
"""
Extracts a speaker embedding from an audio file.
Args:
path: file path
Returns:
speaker embeddings
"""
model = "speechbrain/spkrec-xvect-voxceleb"
encoder = EncoderClassifier.from_hparams(model,
savedir=os.path.join("/tmp", model),
run_opts={"device": "cuda"})
samples, sr = torchaudio.load(path)
samples = encoder.audio_normalizer(samples[0], sr)
embedding = encoder.encode_batch(samples.unsqueeze(0))
return embedding[0,0].to("cuda").unsqueeze(0)
embedding = speaker("reference.wav")
np.save("speaker.npy", embedding.cpu().numpy(), allow_pickle=False)
```
Then load as shown below.
```python
speech, rate = tts("Say something here", speaker=np.load("speaker.npy"))
```
Speaker embeddings from the original SpeechT5 TTS training set are supported. See the [README](https://huggingface.co/microsoft/speecht5_tts#%F0%9F%A4%97-transformers-usage) for more.
|