Calliope SNAC 4B Base (4K)

Stage-1 multilingual SNAC prior for the Calliope text-to-speech project β€” a continued-pretrain of nvidia/Nemotron-H-4B-Base-8K with the vocabulary augmented by 12,288 SNAC codec tokens and a slot router that enforces the codec's CΒ·MΒ·FΒ·FΒ·MΒ·FΒ·F frame pattern at audio-mode positions.

This is the HuggingFace safetensors version, loadable via AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True). The Megatron-Bridge FSDP DCP format is at the sibling repo zeroae/calliope-snac-4b-base-4k.megatron (private) for Bridge-based continued training.

What this is and isn't. This is a pretrained prior, not a finished TTS system. Training mixed text-only and SNAC-audio-only documents — the cross-modal text→SNAC bridge is a separate stage-2 finetune objective and was not learned here. Use this checkpoint as the starting point for a TTS finetune, not as an end-to-end speech model.

Quick start: text generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

REPO = "zeroae/calliope-snac-4b-base-4k"

tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO,
    dtype=torch.bfloat16,
    device_map="cuda",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)

# Text-only generation (the text path is preserved at near-base quality).
# The slot router masks all SNAC tokens to -inf in text mode, so text
# generation is unaffected by the augmented vocab.
ids = tokenizer("In multilingual TTS, prosody", return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0]))

The trust_remote_code=True flag pulls modeling_nemotron_h_augmented.py from this repo, which wraps the base NemotronH-4B with the slot-router logits mask (text mode masks all SNAC tokens, audio mode masks all text β€” enforced per-position).

End-to-end: generate SNAC frames β†’ decode to audio

This pretrained prior has no text→SNAC bridge (see disclaimer above); the example below shows the unconditional end-to-end pipeline that the slot router makes work: prompt with the [SNAC] marker, generate tokens (which the slot router constrains to the C·M·F·F·M·F·F frame pattern), parse them back into the three SNAC codebooks, and decode to a waveform via the upstream hubertsiuzdak/snac_24khz codec.

Expect the audio to be a babble/noise — the model is sampling unconditionally from its learned audio distribution; no text guides the content. The point is to demonstrate the mechanics; quality requires a stage-2 finetune that learns the text→SNAC bridge.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC          # pip install snac
import torchaudio

REPO = "zeroae/calliope-snac-4b-base-4k"

# --- 1. Load LM ---------------------------------------------------------
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda",
    low_cpu_mem_usage=True, trust_remote_code=True,
).eval()

# Vocab layout (from augmented.yaml, also visible in the repo)
SNAC_OPEN, SNAC_CLOSE = 100, 101
C_BASE, M_BASE, F_BASE = 131072, 135168, 139264   # start of each codebook range
N_FRAMES = 50                                      # ~4 s at SNAC-24kHz's coarse rate
N_TOKENS = N_FRAMES * 7                            # 7 tokens / frame (C,M,F,F,M,F,F)

# --- 2. Generate inside an [SNAC] ... span ------------------------------
# The slot router (modeling_nemotron_h_augmented.py) carries its
# (in_slot_mode, slot_counter) state across forward calls via
# self._slot_router_state, so KV caching just works: prefill computes
# routing from initial state, subsequent forwards advance from the
# cached final state. No special flags needed.
prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
with torch.no_grad():
    out = model.generate(
        prompt,
        max_new_tokens=N_TOKENS,
        do_sample=True, temperature=0.8, top_p=0.95,
    )

# --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
gen = out[0, prompt.shape[1]:].tolist()
gen = gen[: (len(gen) // 7) * 7]                    # truncate to whole frames
c_codes, m_codes, f_codes = [], [], []
for i in range(0, len(gen), 7):
    frame = gen[i:i + 7]
    c_codes.append(frame[0] - C_BASE)               # slot 0: C
    m_codes.append(frame[1] - M_BASE)               # slot 1: M
    f_codes.append(frame[2] - F_BASE)               # slot 2: F
    f_codes.append(frame[3] - F_BASE)               # slot 3: F
    m_codes.append(frame[4] - M_BASE)               # slot 4: M
    f_codes.append(frame[5] - F_BASE)               # slot 5: F
    f_codes.append(frame[6] - F_BASE)               # slot 6: F

# Sanity-check the slot router did its job (codes within [0, 4096))
assert all(0 <= c < 4096 for c in c_codes), "C codes out of range β€” slot router off?"
assert all(0 <= m < 4096 for m in m_codes), "M codes out of range"
assert all(0 <= f < 4096 for f in f_codes), "F codes out of range"

# --- 4. Decode the three codebooks to a 24 kHz waveform -----------------
codec = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
codes = [
    torch.tensor([c_codes], dtype=torch.long, device="cuda"),   # [B=1, N_FRAMES]
    torch.tensor([m_codes], dtype=torch.long, device="cuda"),   # [B=1, 2*N_FRAMES]
    torch.tensor([f_codes], dtype=torch.long, device="cuda"),   # [B=1, 4*N_FRAMES]
]
with torch.no_grad():
    audio = codec.decode(codes)                                 # [1, 1, num_samples]

# --- 5. Save ------------------------------------------------------------
torchaudio.save("calliope_unconditional.wav", audio.squeeze(0).cpu(), sample_rate=24000)
print(f"saved {audio.shape[-1] / 24000:.2f} s of audio  "
      f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")

Dependencies: pip install snac torchaudio in addition to transformers torch. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).

Token-budget rule of thumb: SNAC-24kHz's coarse rate is ~12 Hz, so one frame β‰ˆ 83 ms of audio. To pre-allocate max_new_tokens for a given duration:

N_TOKENS = int(seconds * 12) * 7      # 7 tokens per frame

Why this demo's audio sounds bad (and that's expected)

The model has never seen text + [SNAC]…[/SNAC] parallel sequences β€” only text-only documents and SNAC-only documents, mixed at the batch level. Unconditional sampling from the SNAC distribution produces something codec-plausible (the slot router guarantees the bit-stream is structurally valid, and the codec can always decode), but it has no semantic content. It's the analogue of letting a language model generate without a prompt β€” you get gibberish that has the shape of the training distribution. A stage-2 TTS finetune on text β†’ SNAC parallel data is what makes this conditional and intelligible.

Architecture

Field Value
Base model nvidia/Nemotron-H-4B-Base-8K (hybrid Mamba + attention, 52 layers)
Parameters ~4.56 B (4 B base + augmented embedding/lm_head rows)
Vocabulary size 143,360 (131,072 base + 12,288 SNAC + 2 markers + 254 reserved-special unchanged)
New tokens SNAC_C_* (4096), SNAC_M_* (4096), SNAC_F_* (4096), [SNAC] (id 100), [/SNAC] (id 101)
Vocab init mean_resizing (multivariate-normal-matched to existing embedding distribution; Hewitt 2021)
Slot router slot_pattern: [C, M, F, F, M, F, F] β€” masks logits to the relevant range at each frame position; [SNAC]/[/SNAC] markers flip into/out of audio mode
Context length (trained) 4096 (architectural cap is 8192 inherited from base; 8K inference is unverified for this checkpoint β€” the slot-router state machine should extend, but no measurement exists)
Precision bfloat16 weights
Tokenizer NemotronH base tokenizer with the 12,290 new tokens appended

The SNAC frame layout is [C, M, F, F, M, F, F] β€” 7 tokens per coarse frame, one coarse (C) β†’ two mid (M) β†’ four fine (F) β€” matching SNAC-24kHz's 1:2:4 residual-quantizer hierarchy.

Training summary

Wall-clock 12 days (2026-05-08 β†’ 2026-05-21)
Iterations 75,000 (warmup 457 linear β†’ cosine decay β†’ min_lr)
Global batch size 8 (mbs=1 Γ— 8-step gradient accumulation, dp=1)
Sequence length 4096
Tokens consumed ~2.46 B (75k Γ— 8 Γ— 4096)
Single-pass Yes β€” 600,000 of the bin's 601,910 unique samples; no epoch wrap; no overfitting by construction
Optimizer Adam (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-8), weight_decay=0.1, grad-clip 1.0
Peak LR 5e-5 (backbone) / 5e-4 (decoupled β€” embedding + lm_head)
Min LR 5e-7 / 5e-6
Hardware 2Γ— NVIDIA DGX Spark (GB10, sm_121, 128 GB unified each), FSDP ZeRO-3 sharded across both nodes, RoCE interconnect
Framework Megatron-Bridge 0.3.1 + Megatron-Core 0.16.1; resumed across 3 platform-level interruptions

Final validation losses (iter 75,000)

Held-out per-source dev splits, blended at the same per-phase weights as training:

Metric Loss Perplexity
lm (weighted across all positions) 3.944 51.6
loss/range_C (SNAC coarse codes) 4.190 66.0
loss/range_M (SNAC mid codes) 4.509 90.9
loss/range_F (SNAC fine codes) 4.840 126.5
loss/text (FineWeb rehearsal) 2.305 10.03
loss/[SNAC] (open marker) 0.440 1.55
loss/[/SNAC] (close marker) 0.020 1.02

The coarse-to-fine ordering C < M < F is preserved at every eval over the run, consistent with SNAC's residual hierarchy. Text PPL ~10 is approximately base NemotronH-Base quality (the 30% text rehearsal anchor held throughout) β€” the augmented vocab did not catastrophically forget the base model's text capability.

For context, the random-baseline PPL over each 4096-token SNAC range is 4096; the iter-0 augmented-baseline PPL was ~7200 (slightly above random because the newly-added rows perturbed the softmax). Final range_C PPL 66 β‰ˆ 62Γ— better than random on coarse codes.

Data composition

Bin total: 601,910 samples Γ— 4096 tokens = 2.466 B tokens. 44 sources pooled across 10 languages, blended via per-source disjoint phase windows + without-replacement sampling (provably no document seen by more than one phase or twice within a phase).

Realized per-phase quality mix (% of phase, train split):

Phase Iters text bulk-noisy audio clean-read studio
1 β€” Broad foundation 0 – 34,041 25.5% 61.9% 8.7% 3.9%
2 β€” Diversity β†’ quality 34,041 – 58,271 22.7% 52.1% 16.9% 8.3%
3 β€” Studio + speaker balance 58,271 – 70,752 16.5% 30.8% 30.2% 22.5%
4 β€” Anneal 70,752 – 75,000 10.2% 14.3% 43.8% 31.7%

Source corpora (all encoded into Megatron .bin/.idx format via the project's format-snac β†’ format-phases pipeline):

Intended use

  • Starting point for a stage-2 TTS finetune on parallel text β†’ SNAC data. The cross-modal bridge is not present in this checkpoint; supervised finetuning on text-aligned SNAC sequences is what lights it up.
  • Multilingual SNAC perplexity benchmarking across the 10 NemotronH-supported languages.
  • Acoustic embedding extraction β€” pool residual stream activations over a SNAC sequence for downstream classification (language ID, speaker family, audio quality scoring).
  • Audio-only continuation / infilling β€” given partial SNAC, generate plausible continuation. Distribution-in, distribution-out.

Out of scope and limitations

  • Not a TTS system. Stage-1 mixed text-only and SNAC-only documents. There is no learned bridge between text and audio in this checkpoint. Prompting it with text and expecting SNAC output (or vice versa) will not work cleanly without a stage-2 finetune.
  • No speaker conditioning. Speaker tokens / voice control are deferred to the downstream TTS finetune by design.
  • 4K context, not 8K. Architecturally the base supports 8K; the augmented model's slot router was never exercised on sequences > 4K. Use 4K and below for now.
  • Languages outside NemotronH's supported 10 were dropped during data design β€” do not expect quality on e.g. Polish, Indonesian, Vietnamese, Thai, Arabic.
  • HiFi-TTS was in the training mix. If your downstream evaluation uses HiFi-TTS speakers as "held-out studio voices," this prior has already seen them β€” the strict version of the held-out-voice gate cannot be measured on this checkpoint. (Future Calliope stage-1 versions hold HiFi-TTS out entirely; see the v3 plan link below.)
  • Convergence ceiling. The loss plateaued well above the original optimistic target (lm 3.0-3.5 hoped, lm 3.94 reached). The conservative forecast fit at iter-20k called the final values to within 0.02 nats β€” the model is on its forecast trajectory, just lower-quality than initial intuition suggested. Diagnostically traced to LR/optimizer regime (high decoupled_lr perturbing already-converged text rows, body-learning bottleneck), not to data quantity. A v3 design is planned that addresses these (WSD LR schedule, lower decoupled_lr, mean-init confirmed already in use, marginal data blending).

Format details

This repository ships the model in standard HuggingFace safetensors format. Files:

File Purpose
model.safetensors (~17 GB) All weights, bfloat16
config.json NemotronH config + vocab_size: 143360
configuration_nemotron_h.py Config class (base NemotronH)
modeling_nemotron_h.py Base NemotronH modeling (vendored to avoid transformers version drift)
modeling_nemotron_h_augmented.py NemotronHAugmentedForCausalLM β€” wrapper that reads augmented.yaml at __init__ and applies the slot-router logits mask in forward
augmented.yaml Slot router + range definitions; read by the augmented modeling class at load time
tokenizer.json, tokenizer_config.json Augmented tokenizer
generation_config.json Default decoding params
__init__.py Empty, so the dir is a valid Python package for trust_remote_code

The Megatron-Bridge FSDP DCP format of the same trained weights is at the sibling private repo zeroae/calliope-snac-4b-base-4k.megatron β€” use that one if you want to continue pretraining via Bridge.

Provenance & references

License

This checkpoint is a derivative of nvidia/Nemotron-H-4B-Base-8K. The base model's license terms apply to redistribution and use of these weights. Refer to the linked base model for the authoritative license; this repo does not extend or restrict those terms.

The augmented modeling code (modeling_nemotron_h_augmented.py, augmented.yaml, augmentation specs) is Β© Zero A.E., LLC and licensed for research use under terms TBD β€” please contact the org before commercial use.

Downloads last month
84
Safetensors
Model size
5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zeroae/calliope-snac-4b-base-4k

Finetuned
(2)
this model