MisoTTS int4 (BigBlueCeiling)
A weight-only int4 quantization of
BigBlueCeiling/MisoTTS-bf16,
produced with torchao (int4_weight_only). Only the backbone/decoder Linear
layers are quantized; the embeddings, output heads, and projection stay bf16.
EXPERIMENTAL and lower quality. Use int8 or bf16 if your card fits. This file is tinygemm-packed on an Ampere (sm_86) GPU; the serving core falls back to quantizing the bf16 weights at load if it does not load on your architecture.
What it is for
Lowering the hardware floor. Quantization here is a memory lever, not a speed one: MisoTTS decodes one frame at a time, and those tiny per-step matmuls cannot feed the GPU's low-precision tensor cores, so int4 dequantizes to bf16 for the matmul. You get the VRAM saving, not a throughput win.
- Fits: ~12 GB VRAM cards (RTX 3060 12G, 4070, ...)
- Quality: Noticeably degraded: mean CER 0.18, WER 0.26, UTMOS 2.93 (vs bf16 UTMOS 3.94). Worst on long utterances (long-clip CER up to ~0.5). Acceptable only as a last-resort 'runs at all' tier.
Use
This checkpoint is a torch.save'd torchao state_dict (model.pt). The serving
core in the MisoTTS repo pulls it
automatically when GPU-sense detects a matching VRAM tier. To load it directly:
from generator import load_miso_8b # from the MisoTTS repo
gen = load_miso_8b("cuda", model_path_or_repo_id="BigBlueCeiling/MisoTTS-int4",
prequantized=True)
Requires torch>=2.7 and a matching torchao (loading unpickles a torchao tensor
subclass, so weights_only=False is used; load only checkpoints you trust).
Model and original inference code are MisoLabs' work; see the upstream license.