herwoww commited on
Commit
d99f46e
โ€ข
1 Parent(s): cb6f154

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -1
README.md CHANGED
@@ -3,4 +3,109 @@ license: mit
3
  language:
4
  - ar
5
  pipeline_tag: text-to-speech
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - ar
5
  pipeline_tag: text-to-speech
6
+ ---
7
+
8
+ ArTST: SpeechT5 for Arabic (TTS task)
9
+
10
+ Here we use the pretained weights from ArTST and fine-tuned using huggingface implementation of SpeechT5 on Classical Arabic ClArTTS for speech synthesis (text-to-speech).
11
+
12
+ ArTST was first released in [this repository](https://github.com/mbzuai-nlp/ArTST ), [pretrained weights](https://huggingface.co/MBZUAI/ArTST/blob/main/pretrain_checkpoint.pt).
13
+
14
+ # Uses
15
+ ## ๐Ÿค— Transformers Usage
16
+
17
+ You can run ArTST TTS locally with the ๐Ÿค— Transformers library.
18
+
19
+ 1. First install the ๐Ÿค— [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):
20
+
21
+ ```
22
+ pip install --upgrade pip
23
+ pip install --upgrade transformers sentencepiece datasets[audio]
24
+ ```
25
+ 2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the Arabic SPeechT5 model via the TTS pipeline in just a few lines of code!
26
+
27
+ ```python
28
+ from transformers import pipeline
29
+ from datasets import load_dataset
30
+ import soundfile as sf
31
+
32
+ synthesiser = pipeline("text-to-speech", "MBZUAI/speecht5_tts_clartts_ar")
33
+
34
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
35
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
36
+ # You can replace this embedding with your own as well.
37
+
38
+ speech = synthesiser("ู„ุฃู†ู‡ ู„ุง ูŠุฑู‰ ุฃู†ู‡ ุนู„ู‰ ุงู„ุณูู‡ ุซู… ู…ู† ุจุนุฏ ุฐู„ูƒ ุญุฏูŠุซ ู…ู†ุชุดุฑ", forward_params={"speaker_embeddings": speaker_embedding})
39
+ # ArTST is trained without diacritics.
40
+
41
+ sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
42
+ ```
43
+ 3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
44
+
45
+ ```python
46
+ from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
47
+ from datasets import load_dataset
48
+ import torch
49
+ import soundfile as sf
50
+ from datasets import load_dataset
51
+
52
+ processor = SpeechT5Processor.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
53
+ model = SpeechT5ForTextToSpeech.from_pretrained("MBZUAI/speecht5_tts_clartts_ar")
54
+ vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
55
+
56
+ inputs = processor(text="ู„ุฃู†ู‡ ู„ุง ูŠุฑู‰ ุฃู†ู‡ ุนู„ู‰ ุงู„ุณูู‡ ุซู… ู…ู† ุจุนุฏ ุฐู„ูƒ ุญุฏูŠุซ ู…ู†ุชุดุฑ", return_tensors="pt")
57
+
58
+ # load xvector containing speaker's voice characteristics from a dataset
59
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
60
+ speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
61
+
62
+ speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
63
+
64
+ sf.write("speech.wav", speech.numpy(), samplerate=16000)
65
+ ```
66
+
67
+
68
+ # Citation [optional]
69
+
70
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
71
+
72
+ **BibTeX:**
73
+
74
+ ```bibtex
75
+ @inproceedings{toyin-etal-2023-artst,
76
+ title = "{A}r{TST}: {A}rabic Text and Speech Transformer",
77
+ author = "Toyin, Hawau and
78
+ Djanibekov, Amirbek and
79
+ Kulkarni, Ajinkya and
80
+ Aldarmaki, Hanan",
81
+ editor = "Sawaf, Hassan and
82
+ El-Beltagy, Samhaa and
83
+ Zaghouani, Wajdi and
84
+ Magdy, Walid and
85
+ Abdelali, Ahmed and
86
+ Tomeh, Nadi and
87
+ Abu Farha, Ibrahim and
88
+ Habash, Nizar and
89
+ Khalifa, Salam and
90
+ Keleg, Amr and
91
+ Haddad, Hatem and
92
+ Zitouni, Imed and
93
+ Mrini, Khalil and
94
+ Almatham, Rawan",
95
+ booktitle = "Proceedings of ArabicNLP 2023",
96
+ month = dec,
97
+ year = "2023",
98
+ address = "Singapore (Hybrid)",
99
+ publisher = "Association for Computational Linguistics",
100
+ url = "https://aclanthology.org/2023.arabicnlp-1.5",
101
+ pages = "41--51"
102
+ }
103
+ @inproceedings{ao-etal-2022-speecht5,
104
+ title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
105
+ author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
106
+ booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
107
+ month = {May},
108
+ year = {2022},
109
+ pages={5723--5738},
110
+ }
111
+ ```