Miso TTS 8B BF16 for ComfyUI

image

BF16 conversion of Miso TTS 8B prepared specifically for use with the MisoTTS-ComfyUI custom node:

https://github.com/Saganaki22/MisoTTS-ComfyUI

This repository contains converted BF16 weights only. No architectural changes, finetuning, retraining, or modifications to the original model behavior have been made.

Screenshot 2026-06-04 002405

Model Introduction

Miso TTS 8B is a text-to-speech model based on the Sesame CSM architecture. It generates Mimi audio codes from text and optional audio context using a large Llama-style backbone and an autoregressive audio decoder.

This BF16 release is intended for ComfyUI users who want reduced memory usage while maintaining output quality comparable to the original release.

Quickstart

  1. Install ComfyUI.
  2. Install the MisoTTS-ComfyUI custom node: https://github.com/Saganaki22/MisoTTS-ComfyUI
  3. Place this BF16 checkpoint in your MisoTTS model directory.
  4. Load the model using the MisoTTS-ComfyUI loader node.
  5. Generate speech from text or reference-audio workflows.

Model Summary

Item Value
Model Miso TTS 8B
Variant BF16 Conversion
Intended Platform ComfyUI
Custom Node MisoTTS-ComfyUI
Task Text-to-Speech
Architecture Sesame-style CSM
Backbone llama-8B
Audio Decoder llama-300M
Audio Tokenizer Mimi
Text Vocabulary 128,256
Audio Vocabulary 2,051
Audio Codebooks 32
Max Sequence Length 2,048
Precision BF16
Format Safetensors

Architecture

Miso TTS 8B uses two transformer components:

  • A large backbone transformer that consumes text and audio-frame embeddings.
  • A smaller autoregressive decoder transformer that predicts higher-order audio codebooks.

Codebook 0 is predicted directly from the backbone hidden state, while codebooks 1 through 31 are generated autoregressively by the decoder.

This release preserves the original architecture and only changes weight precision.

BF16 Conversion Notes

  • Converted from the original Miso TTS weights.
  • No retraining performed.
  • No finetuning performed.
  • No quantization applied.
  • Intended for lower memory usage compared to FP32 checkpoints.
  • Output quality should remain effectively identical to the original model aside from minor numerical differences inherent to BF16 inference.

Intended Use

This model is intended for:

  • Text-to-speech generation
  • Conversational speech synthesis
  • Voice continuation workflows
  • Reference-audio conditioned speech generation
  • ComfyUI audio generation pipelines

Limitations

  • Voice similarity from reference audio is not guaranteed.
  • Long generations may require workflow chunking.
  • Output quality remains dependent on prompting and generation settings.
  • BF16 support is recommended at the hardware level for optimal performance.

Attribution

Original model:

  • MisoLabs / Miso TTS 8B

ComfyUI integration:

All credit for the original architecture, training, datasets, and research belongs to the original Miso Labs team.

License

This BF16 conversion inherits the licensing and usage restrictions of the original Miso TTS release. Please review the upstream license before use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support