Cosmos3-Nano — MLX 4-bit (Apple Silicon)

A 4-bit MLX build of nvidia/Cosmos3-Nano that runs on Apple Silicon — not just quantized weights, a working text2image model. The custom Cosmos3 omni-MoT diffusion transformer was ported to MLX from scratch (no mlx-vlm support exists for this architecture) and every block was validated against the reference torch implementation.

Derivative of nvidia/Cosmos3-Nano. © NVIDIA. Distributed under OpenMDW-1.1 (license + NVIDIA copyright/origin notices retained). Not affiliated with, nor endorsed by, NVIDIA.

Highlights

Transformer: 30.3 GB bf16 → 12.1 GB MLX-4bit (468 attn+MLP linears quantized, group-64; embeddings/norms/lm_head kept bf16).
Runs ~11 GB peak — fits a 16 GB Mac. ~12 s for a 256² image (M2 Ultra), longer at higher res.
Validated: every module matches torch — primitives ~1e-6, full decoder layer ~1e-3 (bf16), patchify bit-exact.

Usage

import torch
from huggingface_hub import snapshot_download
from mlx_pipeline import MLXCosmos3Transformer        # included in this repo
from diffusers import Cosmos3OmniPipeline, AutoencoderKLWan, UniPCMultistepScheduler
from diffusers.models.autoencoders.autoencoder_cosmos3_audio import Cosmos3AVAEAudioTokenizer
from transformers import AutoTokenizer

repo = snapshot_download("Reza2kn/Cosmos3-Nano-MLX-4bit")
vae = AutoencoderKLWan.from_pretrained(repo, subfolder="vae", torch_dtype=torch.float32).eval()
sched = UniPCMultistepScheduler.from_pretrained(repo, subfolder="scheduler")
tok = AutoTokenizer.from_pretrained(repo, subfolder="text_tokenizer")
st = Cosmos3AVAEAudioTokenizer.from_pretrained(repo, subfolder="sound_tokenizer", torch_dtype=torch.float32).eval()
pipe = Cosmos3OmniPipeline(transformer=MLXCosmos3Transformer(repo + "/transformer"),
        text_tokenizer=tok, vae=vae, scheduler=sched, sound_tokenizer=st, enable_safety_checker=False)
img = pipe("A red panda astronaut floating in a nebula", num_frames=1,
           height=384, width=384, num_inference_steps=24).video[0][0]
img.save("out.png")

Requires: mlx, diffusers (git main / ≥0.39 for Cosmos3), transformers, torch (VAE/scheduler only). The heavy 16B transformer runs in MLX on the GPU; the small VAE/scheduler/tokenizer run in torch.

Quality (honest)

Same profile as any 4-bit build: clean on typical content (portraits, scenes, objects, food — see samples/), but 4-bit defects appear on hard anatomy — e.g. fused/mangled hands (samples/barista.png) and broken limbs in complex poses (samples/anime.png). PickScore (mean 21.42, vs the CUDA builds' ~21.8) does not reliably catch these — eyeball the hard cases. Use FP8/BF16 if you need hands/complex anatomy to hold up.

Status / honesty

text2image: working (samples/*.png), with the 4-bit anatomy caveats above.
text2video: working (samples/t2v_waves.mp4, num_frames>1).
image2video / audio: not implemented yet (image-conditioning + sound paths).
Quantization is 4-bit weight-only — near-original on typical content, with the usual 4-bit wobble on the hardest cases (dense hands, on-image text), same as any 4-bit build.

How it was built

mlx_cosmos3.py (validated MLX modules), mlx_pipeline.py (torch wrapper routing the transformer forward to MLX while reusing torch tokenizer/UniPC/VAE/CFG). Quantized with mx.quantize (group-64, 4-bit), streamed shard-by-shard.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

4-bit

Model tree for Reza2kn/Cosmos3-Nano-MLX-4bit

Base model

nvidia/Cosmos3-Nano

Quantized

(7)

this model