CiSiMi-v0.1 / README.md
Lyte's picture
Update README.md
1bd5e70 verified
---
library_name: transformers
tags:
- unsloth
- text-to-audio
- s2s
license: cc-by-sa-4.0
datasets:
- KandirResearch/Speech2Speech
language:
- en
base_model:
- OuteAI/OuteTTS-0.3-500M
pipeline_tag: text-to-audio
---
# CiSiMi: A Text-to-Speech TTS Model
[![Buy Me A Coffee](https://img.shields.io/badge/Ko--fi-Support%20My%20Work-FF5E5B?style=for-the-badge&logo=ko-fi&logoColor=white)](https://ko-fi.com/lyte)
[![Dataset](https://img.shields.io/badge/Dataset-KandirResearch/Speech2Speech-blue)](https://huggingface.co/datasets/KandirResearch/Speech2Speech)
[![Model](https://img.shields.io/badge/Model-KandirResearch/CiSiMi--v0.1-green)](https://huggingface.co/KandirResearch/CiSiMi-v0.1)
[![Demo](https://img.shields.io/badge/Demo-KandirResearch/CiSiMi--At--Home-orange)](https://huggingface.co/spaces/KandirResearch/CiSiMi-At-Home)
## Overview
CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.
*"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"*
This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities.
## Technical Details
### Model Specifications
- **Architecture**: Based on OuteTTS-0.3-500M
- **Languages**: English
- **Pipeline**: Text-to-audio
- **Parameters**: 500M
- **Training Dataset Size**: ~15k samples
- **Future Goals**: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime.
### Training Methodology
1. **Dataset Preparation**:
- Started with [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct)
- Cleaned by removing code, mathematical expressions, and non-English content
- Filtered to keep only entries with input+output texts of 256 tokens or less
2. **Audio Generation**:
- Converted text outputs to speech using [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)
- Verified each audio generation using [OpenAI Whisper](https://github.com/openai/whisper)
- Published the resulting dataset as [KandirResearch/Speech2Speech](https://huggingface.co/datasets/KandirResearch/Speech2Speech)
3. **Model Training**:
- Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS-0.3/train.md))
- Fine-tuned [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) using Unsloth SFT
- Trained for 6 epochs reaching a loss of 2.27 as a proof of concept
- ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~
## Usage Guide
### Sample
```
Explain to me how gravity works!
```
<audio controls><source src="https://huggingface.co/KandirResearch/CiSiMi-v0.1/resolve/main/sample.wav" type="audio/wav"></audio>
### Installation
```bash
pip install outetts llama-cpp-python --upgrade
pip install huggingface_hub sounddevice
```
### Implementation
```python
import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput
# Download the model
model_path = hf_hub_download(
repo_id="KandirResearch/CiSiMi-v0.1",
filename="unsloth.Q8_0.gguf",
)
# Configure the model
model_config = outetts.GGUFModelConfig_v2(
model_path=model_path,
tokenizer_path="KandirResearch/CiSiMi-v0.1",
)
# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()
# Helper function to extract audio from tokens
def get_audio(tokens):
outputs = prompt_processor.extract_audio_from_tokens(tokens)
if not outputs:
return None
audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
return ModelOutput(audio_tensor, audio_codec.sr)
# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
text = ""
for line in tts_output.strip().split('\n'):
if '<|audio_end|>' in line or '<|im_end|>' in line:
continue
if '<|' in line:
word = line.split('<|')[0].strip()
if word:
text += word + " "
else:
text += line.strip() + " "
return text.strip()
# Generate response function
def generate_response(instruction):
prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
gen_cfg = outetts.GenerationConfig(
text=prompt,
temperature=0.6,
repetition_penalty=1.1,
max_length=4096,
speaker=None
)
input_ids = prompt_processor.tokenizer.encode(prompt)
tokens = gguf_model.generate(input_ids, gen_cfg)
output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
if "<|audio_end|>" in output_text:
first_part, _, _ = output_text.partition("<|audio_end|>")
if "<|audio_end|>\n<|im_end|>\n" not in first_part:
first_part += "<|audio_end|>\n<|im_end|>\n"
extracted_text = extract_text_from_tts_output(first_part)
audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
audio_output = get_audio(audio_tokens)
if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
audio_numpy = audio_output.audio.cpu().numpy()
if audio_numpy.ndim > 1:
audio_numpy = audio_numpy.squeeze()
return extracted_text, (audio_output.sr, audio_numpy)
return output_text, None
# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)
# Play audio if available
if response_audio is not None:
if "ipykernel" in sys.modules:
from IPython.display import display, Audio
display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
else:
import sounddevice as sd
sd.play(response_audio[1], samplerate=response_audio[0])
sd.wait()
```
## Limitations & Future Work
This early prototype has several areas for improvement:
- Limited training data (~15k samples)
- Basic prompt/chat template structure
- Opportunity to optimize training hyperparameters
- Potential for multi-turn conversation capabilities
**Potential Limitation**: This type of model quickly fills up context window, making smaller models generally more practical for implementation.
## Acknowledgments & Citations
This model builds on the following open-source projects:
1. [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) - Base model
2. [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Initial dataset
3. [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - TTS generation
4. [OpenAI Whisper](https://github.com/openai/whisper) - Speech verification
5. [Unsloth](https://github.com/unslothai/unsloth) - Training optimization