MisoTTS int4 (BigBlueCeiling)

A weight-only int4 quantization of BigBlueCeiling/MisoTTS-bf16, produced with torchao (int4_weight_only). Only the backbone/decoder Linear layers are quantized; the embeddings, output heads, and projection stay bf16.

EXPERIMENTAL and lower quality. Use int8 or bf16 if your card fits. This file is tinygemm-packed on an Ampere (sm_86) GPU; the serving core falls back to quantizing the bf16 weights at load if it does not load on your architecture.

What it is for

Lowering the hardware floor. Quantization here is a memory lever, not a speed one: MisoTTS decodes one frame at a time, and those tiny per-step matmuls cannot feed the GPU's low-precision tensor cores, so int4 dequantizes to bf16 for the matmul. You get the VRAM saving, not a throughput win.

  • Fits: ~12 GB VRAM cards (RTX 3060 12G, 4070, ...)
  • Quality: Noticeably degraded: mean CER 0.18, WER 0.26, UTMOS 2.93 (vs bf16 UTMOS 3.94). Worst on long utterances (long-clip CER up to ~0.5). Acceptable only as a last-resort 'runs at all' tier.

Use

This checkpoint is a torch.save'd torchao state_dict (model.pt). The serving core in the MisoTTS repo pulls it automatically when GPU-sense detects a matching VRAM tier. To load it directly:

from generator import load_miso_8b  # from the MisoTTS repo
gen = load_miso_8b("cuda", model_path_or_repo_id="BigBlueCeiling/MisoTTS-int4",
                   prequantized=True)

Requires torch>=2.7 and a matching torchao (loading unpickles a torchao tensor subclass, so weights_only=False is used; load only checkpoints you trust).

Model and original inference code are MisoLabs' work; see the upstream license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BigBlueCeiling/MisoTTS-int4

Base model

MisoLabs/MisoTTS
Finetuned
(2)
this model