|
--- |
|
library_name: transformers |
|
tags: |
|
- unsloth |
|
- text-to-audio |
|
- s2s |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- KandirResearch/Speech2Speech |
|
language: |
|
- en |
|
base_model: |
|
- OuteAI/OuteTTS-0.3-500M |
|
pipeline_tag: text-to-audio |
|
--- |
|
# CiSiMi: A Text-to-Speech TTS Model |
|
|
|
[](https://ko-fi.com/lyte) |
|
[](https://huggingface.co/datasets/KandirResearch/Speech2Speech) |
|
[](https://huggingface.co/KandirResearch/CiSiMi-v0.1) |
|
[](https://huggingface.co/spaces/KandirResearch/CiSiMi-At-Home) |
|
|
|
## Overview |
|
|
|
CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs. |
|
|
|
*"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"* |
|
|
|
This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities. |
|
|
|
## Technical Details |
|
|
|
### Model Specifications |
|
- **Architecture**: Based on OuteTTS-0.3-500M |
|
- **Languages**: English |
|
- **Pipeline**: Text-to-audio |
|
- **Parameters**: 500M |
|
- **Training Dataset Size**: ~15k samples |
|
- **Future Goals**: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime. |
|
|
|
### Training Methodology |
|
|
|
1. **Dataset Preparation**: |
|
- Started with [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) |
|
- Cleaned by removing code, mathematical expressions, and non-English content |
|
- Filtered to keep only entries with input+output texts of 256 tokens or less |
|
|
|
2. **Audio Generation**: |
|
- Converted text outputs to speech using [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) |
|
- Verified each audio generation using [OpenAI Whisper](https://github.com/openai/whisper) |
|
- Published the resulting dataset as [KandirResearch/Speech2Speech](https://huggingface.co/datasets/KandirResearch/Speech2Speech) |
|
|
|
3. **Model Training**: |
|
- Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS-0.3/train.md)) |
|
- Fine-tuned [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) using Unsloth SFT |
|
- Trained for 6 epochs reaching a loss of 2.27 as a proof of concept |
|
- ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~ |
|
|
|
## Usage Guide |
|
|
|
### Sample |
|
|
|
``` |
|
Explain to me how gravity works! |
|
``` |
|
|
|
<audio controls><source src="https://huggingface.co/KandirResearch/CiSiMi-v0.1/resolve/main/sample.wav" type="audio/wav"></audio> |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install outetts llama-cpp-python --upgrade |
|
pip install huggingface_hub sounddevice |
|
``` |
|
|
|
### Implementation |
|
|
|
```python |
|
import torch |
|
import outetts |
|
import numpy as np |
|
from huggingface_hub import hf_hub_download |
|
from outetts.wav_tokenizer.audio_codec import AudioCodec |
|
from outetts.version.v2.prompt_processor import PromptProcessor |
|
from outetts.version.playback import ModelOutput |
|
|
|
# Download the model |
|
model_path = hf_hub_download( |
|
repo_id="KandirResearch/CiSiMi-v0.1", |
|
filename="unsloth.Q8_0.gguf", |
|
) |
|
|
|
# Configure the model |
|
model_config = outetts.GGUFModelConfig_v2( |
|
model_path=model_path, |
|
tokenizer_path="KandirResearch/CiSiMi-v0.1", |
|
) |
|
|
|
# Initialize components |
|
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config) |
|
audio_codec = AudioCodec() |
|
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1") |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
gguf_model = interface.get_model() |
|
|
|
# Helper function to extract audio from tokens |
|
def get_audio(tokens): |
|
outputs = prompt_processor.extract_audio_from_tokens(tokens) |
|
if not outputs: |
|
return None |
|
audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device)) |
|
return ModelOutput(audio_tensor, audio_codec.sr) |
|
|
|
# Helper function to clean text output |
|
def extract_text_from_tts_output(tts_output): |
|
text = "" |
|
for line in tts_output.strip().split('\n'): |
|
if '<|audio_end|>' in line or '<|im_end|>' in line: |
|
continue |
|
if '<|' in line: |
|
word = line.split('<|')[0].strip() |
|
if word: |
|
text += word + " " |
|
else: |
|
text += line.strip() + " " |
|
return text.strip() |
|
|
|
# Generate response function |
|
def generate_response(instruction): |
|
prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n" |
|
gen_cfg = outetts.GenerationConfig( |
|
text=prompt, |
|
temperature=0.6, |
|
repetition_penalty=1.1, |
|
max_length=4096, |
|
speaker=None |
|
) |
|
|
|
input_ids = prompt_processor.tokenizer.encode(prompt) |
|
tokens = gguf_model.generate(input_ids, gen_cfg) |
|
|
|
output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False) |
|
|
|
if "<|audio_end|>" in output_text: |
|
first_part, _, _ = output_text.partition("<|audio_end|>") |
|
|
|
if "<|audio_end|>\n<|im_end|>\n" not in first_part: |
|
first_part += "<|audio_end|>\n<|im_end|>\n" |
|
|
|
extracted_text = extract_text_from_tts_output(first_part) |
|
|
|
audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n") |
|
audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n") |
|
|
|
if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos: |
|
audio_tokens_text = first_part[audio_start_pos:audio_end_pos] |
|
audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text) |
|
audio_output = get_audio(audio_tokens) |
|
|
|
if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None: |
|
audio_numpy = audio_output.audio.cpu().numpy() |
|
if audio_numpy.ndim > 1: |
|
audio_numpy = audio_numpy.squeeze() |
|
|
|
return extracted_text, (audio_output.sr, audio_numpy) |
|
|
|
return output_text, None |
|
|
|
# Example usage |
|
question = "What is the meaning of life?" |
|
response_text, response_audio = generate_response(question) |
|
print(response_text) |
|
|
|
# Play audio if available |
|
if response_audio is not None: |
|
if "ipykernel" in sys.modules: |
|
from IPython.display import display, Audio |
|
display(Audio(response_audio[1], rate=response_audio[0], autoplay=True)) |
|
else: |
|
import sounddevice as sd |
|
sd.play(response_audio[1], samplerate=response_audio[0]) |
|
sd.wait() |
|
``` |
|
|
|
## Limitations & Future Work |
|
|
|
This early prototype has several areas for improvement: |
|
- Limited training data (~15k samples) |
|
- Basic prompt/chat template structure |
|
- Opportunity to optimize training hyperparameters |
|
- Potential for multi-turn conversation capabilities |
|
|
|
**Potential Limitation**: This type of model quickly fills up context window, making smaller models generally more practical for implementation. |
|
|
|
## Acknowledgments & Citations |
|
|
|
This model builds on the following open-source projects: |
|
|
|
1. [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) - Base model |
|
2. [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Initial dataset |
|
3. [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - TTS generation |
|
4. [OpenAI Whisper](https://github.com/openai/whisper) - Speech verification |
|
5. [Unsloth](https://github.com/unslothai/unsloth) - Training optimization |