HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style)
HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a masked-diffusion language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in [MASK] tokens over a few iterative denoising passes โ so it can decode in parallel. This checkpoint is instruction-tuned: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt).
It's part of the HobbyLM family โ a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
(hobby-rs) to run it on a laptop CPU.
Intended use
Experimental conversational generation via iterative denoising โ it's a research artifact, not a reliable assistant. Prompt it with the trained USER: / ASSISTANT: turn format. It adopts the chat register and the questionโanswer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0โ0.3, steps โ 2ร the generation length, repetition penalty 1.4โ1.5.
Architecture
Every HobbyLM variant shares one core: a sparse Mixture-of-Experts (MoE) decoder in the modern small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather than by guesswork.
| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (ฮธ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |
The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss; โฅ32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.
Decoding
Generation is iterative bidirectional denoising of [MASK] tokens, not left-to-right AR. The GGUF carries diffusion.* metadata (mask-token id, block size) for a diffusion-aware runtime; hobby-rs implements the cached semi-autoregressive denoiser.
Benchmarks
A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful numbers are training loss and decoding throughput โ where the diffusion paradigm actually shows up:
| Metric | Value |
|---|---|
| Validation loss (โ21B tokens) | 3.52 |
| Throughput โ H100, 128 tok, 32 steps | 117.7 tok/s (~2.7ร the AR model) |
| Throughput โ H100, AR baseline | ~44 tok/s |
| Throughput โ laptop CPU (q8, cached) | ~6.5 tok/s |
The throughput result reproduces the Fast-dLLM literature's 2โ3ร GPU range from a from-scratch implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is steps-per-token (quality โ speed).
A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence โ the method is fully validated end-to-end here; the limit is capacity and tokens, not the recipe.
Usage
Python (PyTorch reference implementation)
HobbyLM is a custom sparse-MoE architecture โ there's no transformers AutoModel for it, so load it with
the small reference implementation from the GitHub repo:
# HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising
# โ NOT autoregressive โ so it uses the reference diffusion sampler (not transformers.generate).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM
import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.diffusion import generate
repo = "rootxhacker/HobbyLM-Diffusion"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
cfg.expert_backend = "bmm" # "grouped" on CUDA
model = MoETransformer(cfg).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))
enc = tiktoken.get_encoding("gpt2")
ids = torch.tensor([enc.encode_ordinary("The meaning of life is")])
# iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better)
out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2)
print(enc.decode(out[0].tolist()))
GGUF + hobby-rs (CPU)
GGUF builds (architecture hobbylm) live in rootxhacker/HobbyLM-gguf. They load
directly in the from-scratch hobby-rs CPU engine โ stock llama.cpp won't load them without registering
the hobbylm architecture first.
hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64
Training
Two stages. Base: converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). Instruction tuning: chat-SFT on SmolTalk trajectories โ each assistant response is masked and denoised conditioned on the clean prompt.
Limitations
- Hallucinates and follows instructions loosely โ the SFT shifts it into a conversational register and the QโA shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M pure-diffusion model; the limit is capacity, not the recipe.
- Decoding quality is very sensitive to the sampler settings (see above).
- The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster.
License
Apache-2.0. Weights aren't a substitute for judgement โ this is a research / hobby model at the 500M scale, not a production system.
- Downloads last month
- 22