CSM-1B-HF
Sesame CSM 1B model weights for my Hugging Face implementation.
Overview
CSM-HF is a Hugging Face implementation of Sesame's Conversational Speech Model (CSM). CSM-HF is a complete rewrite of the pytorch code provided by Sesame. This codebase is designed to be fully compatible with Hugging Face transformers
, from inference to training.
Changes from Sesame's implementation
- created a
CSMModel
class - replaced backbone and decoder torchtune models with HF transformers
LllamaModel
- added a processor class to prepare inputs for the model
- added labels support and decoder training amortization
- added
generate_frame
andgenerate
methods to the model class for generating audio - full support for HuggingFace
Trainer
Generation
You can use the model to generate audio from text input. Here's an example for voice cloning:
import torch
from modeling_csm import CSMModel
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
from moshi.models import loaders
from processor import CSMProcessor
import torchaudio
device = 'cuda'
def load_llama3_tokenizer():
"""
https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992
"""
tokenizer_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
bos = tokenizer.bos_token
eos = tokenizer.eos_token
tokenizer._tokenizer.post_processor = TemplateProcessing(
single=f"{bos}:0 $A:0 {eos}:0",
pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)],
)
return tokenizer
text_tokenizer = load_llama3_tokenizer()
mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
audio_tokenizer = loaders.get_mimi(mimi_weight, device=device)
audio_tokenizer.set_num_codebooks(32)
processor = CSMProcessor(text_tokenizer, audio_tokenizer)
def load_audio(path, target_sr):
audio, sr = torchaudio.load(path)
audio = audio.squeeze(0)
if sr != target_sr:
audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr)
return audio
model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16)
model.to('cuda')
inputs = processor(
messages=[
{
"role": "speaker_0",
"content": [
{"type": "text", "text": "<AUDIO_CLIP_TRANSCRIPT>"},
{"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor)
]
},
{
"role": "speaker_0",
"content": [
{"type": "text", "text": "Hello, this is voice cloning speaking"},
# does not include audio as the model will generate it
]
}
],
audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)],
return_tensors="pt"
)
import torch
with torch.inference_mode():
# Generate up to 50 new frames
gen_frames = model.generate(
input_ids=inputs['input_ids'].cuda(),
attention_mask=inputs['attention_mask'].cuda(),
max_new_frames=50,
topk=50,
temperature=1.0,
use_cache=True,
stop_on_all_zeros=True,
)
decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0)
audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy()
# Audio can be played with the following code:
# from IPython.display import Audio
# Audio(audio_array, rate=audio_tokenizer.sample_rate)
Architecture
Model architecture is discussed in ARCHITECTURE.md (written by O1)
Training
Data Format
CSM-HF expects training data in a JSONL format, where each line is a JSON object containing a conversation. Each conversation consists of:
messages
: An array of message objects, each with:role
: Speaker identifier (e.g., "speaker_0", "speaker_1")content
: Array of content objects, which can be:- Text:
{"type": "text", "text": "The message text"}
- Audio:
{"type": "audio", "url": "path/to/audio/file.wav"}
- Text:
training_mask
: Boolean array indicating which messages should be used for training (true) or context (false)
Example data format:
{
"messages": [
{
"role": "speaker_0",
"content": [
{"type": "text", "text": "We have a chance for a new life here."},
{"type": "audio", "url": "clips/example_audio.wav"}
]
},
{
"role": "speaker_1",
"content": [
{"type": "text", "text": "Uncle?"},
{"type": "audio", "url": "clips/response_audio.wav"}
]
}
],
"training_mask": [false, true]
}
Training Process
The model uses a two-stage autoregressive architecture:
Backbone (Inter-frame Processing):
- Processes the entire sequence of frames
- Each frame represents a combined embedding of all codebooks
- Handles long-range dependencies between utterances
Decoder (Intra-frame Processing):
- Processes a single frame at a time
- Generates 32 codebooks sequentially (1 semantic + 31 acoustic)
- Each codebook is treated as a token in the sequence
Training leverages compute amortization techniques:
- The zeroth (semantic) codebook is trained on all frames
- The remaining codebooks (1-31) are trained on only
amortization_ratio
of the frames - This significantly reduces memory usage while maintaining quality
To train the model:
python train.py \
--train_file path/to/training_data.jsonl \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-6
TODO
- Two-stage autoregressive architecture implementation
- Multi-codebook audio tokenization
- Compute amortization for efficient training
- Dataset preparation with interleaved text/audio
- Custom training loop with separate backbone/decoder losses
- Proper handling of epoch repetition for decoder amortization
- Memory optimization techniques (mixed precision, gradient accumulation)
- LoRA support for efficient fine-tuning
- Faster inference with
torch.compile
- Coice cloning with prompt tuning / prefix optimization
- Support for DPO
- Support for RL (GRPO, RLOO, etc.)
Acknowledgements
Special thanks to:
- Sesame Labs for the original architecture design and implementation
- Hugging Face for the Transformers library and training infrastructure
- Claude and ChatGPT for assistance with documentation and code development
This project builds upon research and tools from the open-source community. I am grateful for the collaborative spirit that makes projects like this possible.
- Downloads last month
- 514