--- license: apache-2.0 language: - en pipeline_tag: text-to-speech tags: - text-to-speech --- ## CSM 1B **2025/05/20** - CSM is availabile natively in [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/csm) 🤗 as of version `4.52.1` **2025/03/13** - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https://huggingface.co/sesame/csm_1b). --- CSM (Conversational Speech Model) is a speech generation model from [Sesame](sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes. A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation. ## Usage ### Generate a sentence ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor model_id = "sesame/csm-1b" device = "cuda" if torch.cuda.is_available() else "cpu" # load the model and the processor processor = AutoProcessor.from_pretrained(model_id) model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) # prepare the inputs text = "[0]Hello from Sesame." # `[0]` for speaker id 0 inputs = processor(text, add_special_tokens=True).to(device) # another equivalent way to prepare the inputs conversation = [ {"role": "0", "content": [{"type": "text", "text": "Hello from Sesame."}]}, ] inputs = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) # infer the model audio = model.generate(**inputs, output_audio=True) processor.save_audio(audio, "example_without_context.wav") ``` ### CSM sounds best when provided with context ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio model_id = "sesame/csm-1b" device = "cuda" if torch.cuda.is_available() else "cpu" # load the model and the processor processor = AutoProcessor.from_pretrained(model_id) model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) # prepare the inputs ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") # ensure the audio is 24kHz ds = ds.cast_column("audio", Audio(sampling_rate=24000)) conversation = [] # 1. context for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]): conversation.append( { "role": f"{speaker_id}", "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}], } ) # 2. text prompt conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]}) inputs = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) # infer the model audio = model.generate(**inputs, output_audio=True) processor.save_audio(audio, "example_with_context.wav") ``` --- ### Batched Inference 📦 CSM supports batched inference:
code snippet ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset, Audio model_id = "sesame/csm-1b" device = "cuda" if torch.cuda.is_available() else "cpu" # load the model and the processor processor = AutoProcessor.from_pretrained(model_id) model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) # prepare the inputs ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") # ensure the audio is 24kHz ds = ds.cast_column("audio", Audio(sampling_rate=24000)) # here a batch with two prompts conversation = [ [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, ], }, ], [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, ], } ], ] inputs = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) audio = model.generate(**inputs, output_audio=True) processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))]) ```
### Making The Model Go Brrr 🏎️ CSM supports full-graph compilation with CUDA graphs!
code snippet ```python import torch import copy from transformers import CsmForConditionalGeneration, AutoProcessor from datasets import load_dataset model_id = "sesame/csm-1b" device = "cuda" # set logs to ensure no recompilation and graph breaks torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True) # load the model and the processor processor = AutoProcessor.from_pretrained(model_id) model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device) # use static cache, enabling automatically torch compile with fullgraph and reduce-overhead model.generation_config.max_length = 250 # big enough to avoid recompilation model.generation_config.max_new_tokens = None # would take precedence over max_length model.generation_config.cache_implementation = "static" model.depth_decoder.generation_config.cache_implementation = "static" # generation kwargs gen_kwargs = { "do_sample": False, "depth_decoder_do_sample": False, "temperature": 1.0, "depth_decoder_temperature": 1.0, } # Define a timing decorator class TimerContext: def __init__(self, name="Execution"): self.name = name self.start_event = None self.end_event = None def __enter__(self): # Use CUDA events for more accurate GPU timing self.start_event = torch.cuda.Event(enable_timing=True) self.end_event = torch.cuda.Event(enable_timing=True) self.start_event.record() return self def __exit__(self, *args): self.end_event.record() torch.cuda.synchronize() elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0 print(f"{self.name} time: {elapsed_time:.4f} seconds") # prepare the inputs ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train") conversation = [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[0]["text"]}, {"type": "audio", "path": ds[0]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[1]["text"]}, {"type": "audio", "path": ds[1]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[2]["text"]}, ], }, ] padded_inputs_1 = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) print("\n" + "="*50) print("First generation - compiling and recording CUDA graphs...") with TimerContext("First generation"): _ = model.generate(**padded_inputs_1, **gen_kwargs) print("="*50) print("\n" + "="*50) print("Second generation - fast !!!") with TimerContext("Second generation"): _ = model.generate(**padded_inputs_1, **gen_kwargs) print("="*50) # now with different inputs conversation = [ { "role": f"{ds[0]['speaker_id']}", "content": [ {"type": "text", "text": ds[2]["text"]}, {"type": "audio", "path": ds[2]["audio"]["array"]}, ], }, { "role": f"{ds[1]['speaker_id']}", "content": [ {"type": "text", "text": ds[3]["text"]}, {"type": "audio", "path": ds[3]["audio"]["array"]}, ], }, { "role": f"{ds[2]['speaker_id']}", "content": [ {"type": "text", "text": ds[4]["text"]}, ], }, ] padded_inputs_2 = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) print("\n" + "="*50) print("Generation with other inputs!") with TimerContext("Generation with different inputs"): _ = model.generate(**padded_inputs_2, **gen_kwargs) print("="*50) ```
### Fine-tuning & training 📉 CSM can be fine-tuned using [Transformers' Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer).
code snippet ```python from datasets import load_dataset, Audio from transformers import ( CsmForConditionalGeneration, TrainingArguments, CsmProcessor, Trainer ) processor = CsmProcessor.from_pretrained("sesame/csm-1b") model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b") model.train() model.codec_model.eval() ds = load_dataset("eustlb/dailytalk-conversations-grouped", split="train") ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate)) def data_collator(samples): conversations = [] for sample in samples: concatenated_audio_array = sample["audio"]["array"] audio = [concatenated_audio_array[s: e] for s, e in sample["audio_cut_idxs"]] conversation = [] for speaker_id, text, audio in zip(sample["speaker_ids"], sample["texts"], audio): conversation.append({ "role": f"{speaker_id}", "content": [ {"type": "text", "text": text}, {"type": "audio", "audio": audio} ] }) conversations.append(conversation) inputs = processor.apply_chat_template( conversations, tokenize=True, return_dict=True, output_labels=True, ) return inputs training_args = TrainingArguments( "test-trainer", remove_unused_columns=False, gradient_checkpointing=True, ) trainer = Trainer( model, training_args, train_dataset=ds, data_collator=data_collator, ) trainer.train() ```
--- ## FAQ **Does this model come with any voices?** The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice. **Can I converse with the model?** CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation. **Does it support other languages?** The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well. ## Misuse and abuse ⚠️ This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following: - **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent. - **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls. - **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes. By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology. **Authors** Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.