Miso TTS 8B BF16 for ComfyUI

BF16 conversion of Miso TTS 8B prepared specifically for use with the MisoTTS-ComfyUI custom node:

https://github.com/Saganaki22/MisoTTS-ComfyUI

This repository contains converted BF16 weights only. No architectural changes, finetuning, retraining, or modifications to the original model behavior have been made.

Model Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context using a large Llama-style backbone and an autoregressive audio decoder.

This BF16 release is intended for ComfyUI users who want reduced memory usage while maintaining output quality comparable to the original release.

Quickstart

Install ComfyUI.
Install the MisoTTS-ComfyUI custom node: https://github.com/Saganaki22/MisoTTS-ComfyUI
Place this BF16 checkpoint in your MisoTTS model directory.
Load the model using the MisoTTS-ComfyUI loader node.
Generate speech from text or reference-audio workflows.

Model Summary

Item	Value
Model	Miso TTS 8B
Variant	BF16 Conversion
Intended Platform	ComfyUI
Custom Node	MisoTTS-ComfyUI
Task	Text-to-Speech
Architecture	Sesame-style CSM
Backbone	llama-8B
Audio Decoder	llama-300M
Audio Tokenizer	Mimi
Text Vocabulary	128,256
Audio Vocabulary	2,051
Audio Codebooks	32
Max Sequence Length	2,048
Precision	BF16
Format	Safetensors

Architecture

Miso TTS 8B uses two transformer components:

A large backbone transformer that consumes text and audio-frame embeddings.
A smaller autoregressive decoder transformer that predicts higher-order audio codebooks.

Codebook 0 is predicted directly from the backbone hidden state, while codebooks 1 through 31 are generated autoregressively by the decoder.

This release preserves the original architecture and only changes weight precision.

BF16 Conversion Notes

Converted from the original Miso TTS weights.
No retraining performed.
No finetuning performed.
No quantization applied.
Intended for lower memory usage compared to FP32 checkpoints.
Output quality should remain effectively identical to the original model aside from minor numerical differences inherent to BF16 inference.

Intended Use

This model is intended for:

Text-to-speech generation
Conversational speech synthesis
Voice continuation workflows
Reference-audio conditioned speech generation
ComfyUI audio generation pipelines

Limitations

Voice similarity from reference audio is not guaranteed.
Long generations may require workflow chunking.
Output quality remains dependent on prompting and generation settings.
BF16 support is recommended at the hardware level for optimal performance.

Attribution

Original model:

MisoLabs / Miso TTS 8B

ComfyUI integration:

https://github.com/Saganaki22/MisoTTS-ComfyUI

All credit for the original architecture, training, datasets, and research belongs to the original Miso Labs team.

License

This BF16 conversion inherits the licensing and usage restrictions of the original Miso TTS release. Please review the upstream license before use.

Downloads last month: -; Downloads are not tracked for this model. How to track