nano4M-Audio — trained checkpoint

Extending the 4M masked-multimodal framework to audio as a 5th modality — a controlled study at small academic scale (COM-304, EPFL, Spring 2026).

Overview

nano4M-Audio adds audio to the 4M encoder–decoder transformer without any architectural change — the only training-side modification is contiguous span masking on the audio stream. It is trained jointly on five tokenized modalities (RGB, audio, depth, surface normals, caption) over a self-built set of animal-vocalization clips. The structural modalities learn strongly and the iterative generation framework works in the canonical 4M directions; audio acquires conditional structure at the token level but does not lift to usable cross-modal audio↔vision generation. The contribution is the precise diagnostic of why, not a working audio generator — see the report.

Model details

Property Value
Architecture encoder–decoder transformer (nanofm.models.fourm.FourM), d6-6w512
Parameters 95.84 M
Width / heads dim=512, head_dim=64, enc_depth=6, dec_depth=6
Precision fp32 (bf16 NaN'd in the unified-vocab softmax)
Vocabulary unified, max(vocab_sizes) = 50,304; modality + position embeddings disambiguate streams
Loss per-modality, length-normalized cross-entropy, averaged
Base framework apple/ml-4m + the nano4M course re-implementation

Modalities

Modality Tokenizer Seq len Vocab
tok_rgb@196 4M-16k DiVAE 196 16,384
tok_audio@512 EnCodec 24 kHz, K=2 RVQ @ 1.5 kbps (delay/flatten, cb2 +1024) 512 2,048
tok_depth@196 Depth-Anything-V2 → 4M-8k DiVAE 196 8,192
tok_normal@196 DSINE → 4M-8k DiVAE 196 8,192
scene_desc GPT-2 BPE ("a photo of a <class>") ≤64 50,304

Training

  • 18,311 steps, batch size 64, ~600 M tokens, ~1h10 on 1× NVIDIA H100, fp32.
  • Optimizer AdamW (β = 0.9, 0.95), weight decay 0.05, gradient clip 1.0.
  • Cosine LR 1e-4 → 1e-6, 916 warmup steps. Fixed seed; deterministic clip-level split released.
  • Masking: standard 4M Dirichlet (random) for RGB/depth/normal/caption with per-sample token budgets in [16, 256]; contiguous span masking (stride 2) for audio so the decoder cannot copy an adjacent EnCodec frame.

Dataset

zed-m97/nano4m-audio-tokenized — 9,192 clips over 11 animal classes (cat, chicken, cow, coyote, dog, duck, horse, lion, pig, sheep, pigeon), sourced from AudioSet + VGGSound and cleaned by a 3-stage PANNs / CLIP / Silero-VAD filter; clip-level stratified split (seed 42): 7,347 / 907 / 938 train / val / test. Depth and normal targets are pseudo-labeled (Depth-Anything-V2, DSINE).

How to use

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from omegaconf import OmegaConf
from hydra.utils import instantiate

# clone the repo for the model code + config first:
#   git clone https://github.com/ziyad-m97/nano4M-Audio && cd nano4M-Audio && pip install -e .
cfg   = OmegaConf.load("cfgs/nano4M/animal_full_5mod_v5.yaml")
model = instantiate(cfg.model_config)
sd    = load_file(hf_hub_download("zed-m97/nano4m-audio", "checkpoint-final.safetensors"))
model.load_state_dict(sd, strict=False)
model.eval()

The full evaluation harness is in notebooks/final_evaluation.ipynb; the actual outputs are committed under eval_results/.

Evaluation results (held-out 938-clip test set)

Probe Model Random baseline
Audio eval CE 5.28 nats 7.62 (log 2048); ~6.2 empirical marginal
Depth / Normal eval CE 5.11 / 3.45 9.01
RGB eval CE 9.14 9.70
Audio → class, top-1 / top-5 10.4% / 48.4% 9.1% / 45.5%
Best cross-modal retrieval R@5 (depth→audio) 4.5% 2.5%
RGB → depth / RGB → normal token acc 11.1% / 18.0% ~0.012%
Audio → RGB ImageNet ResNet-50 top-5 hit 0% ~5%
Memorization probe (train / test acc) 2.95% / 4.13%
RGB tokenizer fidelity PSNR 19.1 dB, SSIM 0.80

The asymmetry. Audio captures ~1 nat of conditional structure per token (CE 5.28 < the ~6.2-nat marginal) and is weakly class-discriminative, yet cross-modal generation mode-collapses. We trace this to three causes: (1) a train/inference masking mismatch that makes single-source decoding out-of-distribution; (2) an acoustic-only EnCodec tokenizer with no class semantics; (3) operating at ~10⁴ clips, below the cross-modal emergence threshold of the contrastive audio-visual literature.

License

MIT for the model weights and this repository's contributions. The underlying 4M code is licensed under Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train zed-m97/nano4m-audio