|
--- |
|
library_name: transformers |
|
language: |
|
- yo |
|
- ig |
|
- ha |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-360M |
|
- saheedniyi/YarnGPT |
|
pipeline_tag: text-to-speech |
|
license: cc-by-nc-sa-4.0 |
|
--- |
|
|
|
# YarnGPT-local |
|
 |
|
|
|
## Table of Contents |
|
|
|
1. [Model Summary](#model-summary) |
|
2. [Model Description](#model-description) |
|
3. [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Recommendations](#recommendations) |
|
4. [Speech Samples](#speech-samples) |
|
5. [Training](#training) |
|
6. [Future Improvements](#future-improvements) |
|
7. [Citation](#citation) |
|
8. [Credits & References](#credits--references) |
|
|
|
## Model Summary |
|
|
|
YarnGPT-local is a text-to-speech (TTS) model designed to synthesize Yoruba, Igbo and Hausa leveraging pure language modelling without external adapters or complex architectures, offering high-quality, natural, and culturally relevant speech synthesis for diverse applications. |
|
|
|
<video controls width="600"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/YarnGPT-Local.mp4" type="video/mp4"> |
|
Your browser does not support the video tag. |
|
</video> |
|
|
|
#### How to use (on Google Colab) |
|
The model can generate audio on its own but its better to use a voice to prompt the model, there are about 10 voices supported by default: |
|
- hausa_female1 |
|
- hausa_female2 |
|
- hausa_male1 |
|
- hausa_male2 |
|
- igbo_female1 |
|
- igbo_female2 |
|
- igbo_male2 |
|
- yoruba_female1 |
|
- yoruba_female2 |
|
- yoruba_male2 |
|
|
|
### Prompt YarnGPT-local |
|
```python |
|
# clone the YarnGPT repo to get access to the `audiotokenizer` |
|
!git clone https://github.com/saheedniyi02/yarngpt.git |
|
|
|
|
|
# install some necessary libraries |
|
!pip install outetts==0.2.3 uroman |
|
|
|
#import some important packages |
|
import os |
|
import re |
|
import json |
|
import torch |
|
import inflect |
|
import random |
|
import uroman as ur |
|
import numpy as np |
|
import torchaudio |
|
import IPython |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from outetts.wav_tokenizer.decoder import WavTokenizer |
|
from yarngpt.audiotokenizer import AudioTokenizerForLocal |
|
|
|
|
|
# download the wavtokenizer weights and config (to encode and decode the audio) |
|
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml |
|
!wget https://huggingface.co/novateur/WavTokenizer-large-speech-75token/resolve/main/wavtokenizer_large_speech_320_24k.ckpt |
|
|
|
# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location). |
|
hf_path="saheedniyi/YarnGPT-local" |
|
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml" |
|
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt" |
|
|
|
# create the AudioTokenizer object |
|
audio_tokenizer=AudioTokenizerForLocal( |
|
hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path |
|
) |
|
|
|
#load the model weights |
|
|
|
model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device) |
|
|
|
# your input text |
|
text="Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un." |
|
|
|
# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter |
|
prompt=audio_tokenizer.create_prompt(text,"yoruba","yoruba_male2") |
|
|
|
# tokenize the prompt |
|
input_ids=audio_tokenizer.tokenize_prompt(prompt) |
|
|
|
# generate output from the model, you can tune the `.generate` parameters as you wish |
|
output = model.generate( |
|
input_ids=input_ids, |
|
temperature=0.1, |
|
repetition_penalty=1.1, |
|
num_beams=4, |
|
max_length=4000, |
|
) |
|
|
|
# convert the output to "audio codes" |
|
codes=audio_tokenizer.get_codes(output) |
|
|
|
# converts the codes to audio |
|
audio=audio_tokenizer.get_audio(codes) |
|
|
|
# play the audio |
|
IPython.display.Audio(audio,rate=24000) |
|
|
|
# save the audio |
|
torchaudio.save(f"audio.wav", audio, sample_rate=24000) |
|
``` |
|
|
|
### Simple News-Reader for Local languages |
|
```python |
|
# clone the YarnGPT repo to get access to the `audiotokenizer` |
|
!git clone https://github.com/saheedniyi02/yarngpt.git |
|
|
|
|
|
# install some necessary libraries |
|
!pip install outetts uroman trafilatura pydub |
|
|
|
|
|
#import important packages |
|
import os |
|
import re |
|
import json |
|
import torch |
|
import inflect |
|
import random |
|
import requests |
|
import trafilatura |
|
import inflect |
|
import uroman as ur |
|
import numpy as np |
|
import torchaudio |
|
import IPython |
|
from pydub import AudioSegment |
|
from pydub.effects import normalize |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from outetts.wav_tokenizer.decoder import WavTokenizer |
|
from yarngpt.audiotokenizer import AudioTokenizer,AudioTokenizerForLocal |
|
|
|
# download the `WavTokenizer` files |
|
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml |
|
!wget https://huggingface.co/novateur/WavTokenizer-large-speech-75token/resolve/main/wavtokenizer_large_speech_320_24k.ckpt |
|
|
|
tokenizer_path="saheedniyi/YarnGPT-local" |
|
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml" |
|
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt" |
|
|
|
|
|
audio_tokenizer=AudioTokenizerForLocal( |
|
tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path |
|
) |
|
|
|
model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device) |
|
|
|
# Split text into chunks |
|
def split_text_into_chunks(text, word_limit=25): |
|
sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()] |
|
chunks=[] |
|
for sentence in sentences: |
|
chunks.append(".") |
|
sentence_splitted=sentence.split(" ") |
|
num_words=len(sentence_splitted) |
|
start_index=0 |
|
if num_words>word_limit: |
|
while start_index<num_words: |
|
end_index=min(num_words,start_index+word_limit) |
|
chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit])) |
|
start_index=end_index |
|
else: |
|
chunks.append(sentence) |
|
return chunks |
|
|
|
# reduce the speed of the audio, results from the local languages are always fast |
|
def speed_change(sound, speed=0.9): |
|
# Manually override the frame_rate. This tells the computer how many |
|
# samples to play per second |
|
sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={ |
|
"frame_rate": int(sound.frame_rate * speed) |
|
}) |
|
# convert the sound with altered frame rate to a standard frame rate |
|
# so that regular playback programs will work right. They often only |
|
# know how to play audio at standard frame rate (like 44.1k) |
|
return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate) |
|
|
|
|
|
page=requests.get("https://alaroye.org/a-maa-too-fo-ipinle-ogun-mo-omo-egbe-okunkun-meje-lowo-ti-te-bayii-omolola/") |
|
content=trafilatura.extract(page.text) |
|
chunks=split_text_into_chunks(content) |
|
|
|
|
|
all_codes=[] |
|
for i,chunk in enumerate(chunks): |
|
print(i) |
|
print("\n") |
|
print(chunk) |
|
if chunk==".": |
|
#add silence for 0.5 seconds if we encounter a full stop |
|
all_codes.extend([453]*38) |
|
else: |
|
prompt=audio_tokenizer.create_prompt(chunk,lang="yoruba",speaker_name="yoruba_female2") |
|
input_ids=audio_tokenizer.tokenize_prompt(prompt) |
|
output = model.generate( |
|
input_ids=input_ids, |
|
temperature=0.1, |
|
repetition_penalty=1.1, |
|
max_length=4000, |
|
num_beams=5, |
|
) |
|
codes=audio_tokenizer.get_codes(output) |
|
all_codes.extend(codes) |
|
|
|
|
|
audio=audio_tokenizer.get_audio(all_codes) |
|
|
|
#display the output |
|
IPython.display.Audio(audio,rate=24000) |
|
|
|
#save audio |
|
torchaudio.save(f"news1.wav", audio, sample_rate=24000) |
|
|
|
#convert file to an `AudioSegment` object for furher processing |
|
audio_dub=AudioSegment.from_file("news1.wav") |
|
|
|
# reduce audio speed: it reduces quality also |
|
speed_change(audio_dub,0.9) |
|
``` |
|
|
|
|
|
## Model Description |
|
|
|
- **Developed by:** [Saheedniyi](https://linkedin.com/in/azeez-saheed) |
|
- **Model type:** Text-to-Speech |
|
- **Language(s) (NLP):** Igbo, Yoruba, Hausa--> Speech |
|
- **Finetuned from:** [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) |
|
- **Repository:** [YarnGPT Github Repository](https://github.com/saheedniyi02/yarngpt) |
|
- **Paper:** IN PROGRESS. |
|
- **Demo:** 1) [Prompt YarnGPT-local notebook](https://colab.research.google.com/drive/1UWeirECQbjFGib1SqpiDdkzS1Bi_vi9i?usp=sharing) |
|
2) [Simple news reader: YarnGPT-local](https://colab.research.google.com/drive/1CMsLVsDaX2u4YUtV01fOvnDCtCC59bNe?usp=sharing) |
|
|
|
#### Uses |
|
|
|
Generate yoruba, igbo and hausa speech for experimental purposes. |
|
|
|
|
|
#### Out-of-Scope Use |
|
|
|
The model is not suitable for generating speech in languages other than Yoruba, Igbo and Hausa. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset. |
|
- The audio generated by the model are sometimes very fast and might need some post-processing to be done. |
|
- The model doesn't take 'intonations' into account which sometimes leads to mispronounce meant of some words. |
|
- Model doesn't respond to some prompt |
|
|
|
|
|
|
|
#### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged. |
|
## Speech Samples |
|
|
|
Listen to samples generated by YarnGPT: |
|
|
|
<div style="margin-top: 20px;"> |
|
<table style="width: 100%; border-collapse: collapse;"> |
|
<thead> |
|
<tr> |
|
<th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 40%;">Input</th> |
|
<th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 40%;">Audio</th> |
|
<th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 10%;">Notes</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td style="border: 1px solid #ddd; padding: 8px;">Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> |
|
<audio controls style="width: 100%;"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/Sample1_yor.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: yoruba_male2</td> |
|
</tr> |
|
<tr> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> Iwadii fihan pe ọkan lara awọn eeyan meji yii lo ṣee si ja sinu tanki epo disu naa lasiko to n ṣiṣẹ lọwọ.</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> |
|
<audio controls style="width: 100%;"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/Sample2_yor.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: yoruba_female1</td> |
|
</tr> |
|
<tr> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> Shirun da gwamnati mai ci yanzu ta yi wajen kin bayani a akan halin da ake ciki a game da batun kidayar shi ne ya janyo wannan zargi da jam'iyyar ta Labour ta yi.</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> |
|
<audio controls style="width: 100%;"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/Sample1_hau.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: hausa_male2</td> |
|
</tr> |
|
<tr> |
|
<td style="border: 1px solid #ddd; padding: 8px;">A lokuta da dama yakan fito a matsayin jarumin da ke taimaka wa babban jarumi, kodayake a wasu fina-finan yakan fito a matsayin babban jarumi.</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> |
|
<audio controls style="width: 100%;"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/Sample2_hau.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: hausa_female1</td> |
|
</tr> |
|
<tr> |
|
<td style="border: 1px solid #ddd; padding: 8px;">Amụma ndị ọzọ o buru gụnyere inweta ihe zuru oke, ịmụta ụmụaka nye ndị na-achọ nwa</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;"> |
|
<audio controls style="width: 100%;"> |
|
<source src="https://huggingface.co/saheedniyi/YarnGPT-local/resolve/main/audio/Sample1_igb.wav" type="audio/wav"> |
|
Your browser does not support the audio element. |
|
</audio> |
|
</td> |
|
<td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1,num_beams=4), voice: igbo_female1</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
|
|
## Training |
|
|
|
#### Data |
|
Trained on open source dataset on Yoruba, Igbo and Hausa. |
|
|
|
#### Preprocessing |
|
|
|
Audio files were preprocessed and resampled to 24Khz and tokenized using [wavtokenizer](https://huggingface.co/novateur/WavTokenizer). |
|
|
|
#### Training Hyperparameters |
|
- **Number of epochs:** 5 |
|
- **batch_size:** 4 |
|
- **Scheduler:** linear schedule with warmup for 4 epochs, then linear decay to zero for the last epoch |
|
- **Optimizer:** AdamW (betas=(0.9, 0.95),weight_decay=0.01) |
|
- **Learning rate:** 1*10^-3 |
|
|
|
#### Hardware |
|
|
|
- **GPUs:** 1 A100 (google colab: 30 hours) |
|
|
|
#### Software |
|
|
|
- **Training Framework:** Pytorch |
|
|
|
## Future Improvements? |
|
- Scaling up model size and training data |
|
- Wrap the model around an API endpoint |
|
- Voice cloning. |
|
- Potential expansion into speech-to-speech assistant models |
|
|
|
## Citation [optional] |
|
|
|
#### BibTeX: |
|
|
|
```python |
|
@misc{yarngpt2025, |
|
author = {Saheed Azeez}, |
|
title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/SaheedAzeez/yarngpt} |
|
} |
|
``` |
|
|
|
#### APA: |
|
|
|
```python |
|
Saheed Azeez. (2025). YarnGPT-local: Nigerian languages Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT-local |
|
``` |
|
|
|
|
|
## Credits & References |
|
- [OuteAI/OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M/) |
|
- [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) |
|
- [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html) |
|
- [Voicera](https://huggingface.co/Lwasinam/voicera) |