HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style)

HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a masked-diffusion language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in [MASK] tokens over a few iterative denoising passes — so it can decode in parallel. This checkpoint is instruction-tuned: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt).

It's part of the HobbyLM family — a 500M sparse-MoE model (and its variants) built from scratch on a hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine (hobby-rs) to run it on a laptop CPU.

Intended use

Experimental conversational generation via iterative denoising — it's a research artifact, not a reliable assistant. Prompt it with the trained USER: / ASSISTANT: turn format. It adopts the chat register and the question→answer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0–0.3, steps ≈ 2× the generation length, repetition penalty 1.4–1.5.

Architecture

Every HobbyLM variant shares one core: a sparse Mixture-of-Experts (MoE) decoder in the modern small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather than by guesswork.

Component	Value
Total parameters	~500M (only a fraction is active per token)
Hidden size / layers	768 / 16 (first FFN dense, the rest MoE)
Routed experts / active	36 / top-6 (+ 1 always-on shared expert)
Attention	GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm
Router	sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm
Positional	RoPE (θ up to 1e6 for the 8k-context checkpoints)
Tokenizer	GPT-2 byte-level BPE (50,304 vocab, sentinel-padded)
Optimizer	Muon on the 2-D + per-expert matrices, AdamW on everything else

The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss; ≥32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

Decoding

Generation is iterative bidirectional denoising of [MASK] tokens, not left-to-right AR. The GGUF carries diffusion.* metadata (mask-token id, block size) for a diffusion-aware runtime; hobby-rs implements the cached semi-autoregressive denoiser.

Benchmarks

A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful numbers are training loss and decoding throughput — where the diffusion paradigm actually shows up:

Metric	Value
Validation loss (≈21B tokens)	3.52
Throughput — H100, 128 tok, 32 steps	117.7 tok/s (~2.7× the AR model)
Throughput — H100, AR baseline	~44 tok/s
Throughput — laptop CPU (q8, cached)	~6.5 tok/s

The throughput result reproduces the Fast-dLLM literature's 2–3× GPU range from a from-scratch implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is steps-per-token (quality ↔ speed).

A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence — the method is fully validated end-to-end here; the limit is capacity and tokens, not the recipe.

Usage

Python (PyTorch reference implementation)

HobbyLM is a custom sparse-MoE architecture — there's no transformers AutoModel for it, so load it with the small reference implementation from the GitHub repo:

# HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising
# — NOT autoregressive — so it uses the reference diffusion sampler (not transformers.generate).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.diffusion import generate

repo = "rootxhacker/HobbyLM-Diffusion"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
cfg.expert_backend = "bmm"                          # "grouped" on CUDA
model = MoETransformer(cfg).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

enc = tiktoken.get_encoding("gpt2")
ids = torch.tensor([enc.encode_ordinary("The meaning of life is")])
# iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better)
out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2)
print(enc.decode(out[0].tolist()))

GGUF + hobby-rs (CPU)

GGUF builds (architecture hobbylm) live in rootxhacker/HobbyLM-gguf. They load directly in the from-scratch hobby-rs CPU engine — stock llama.cpp won't load them without registering the hobbylm architecture first.

hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64

Training

Two stages. Base: converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). Instruction tuning: chat-SFT on SmolTalk trajectories — each assistant response is masked and denoised conditioned on the clean prompt.

Limitations

Hallucinates and follows instructions loosely — the SFT shifts it into a conversational register and the Q→A shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M pure-diffusion model; the limit is capacity, not the recipe.
Decoding quality is very sensitive to the sampler settings (see above).
The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster.

License

Apache-2.0. Weights aren't a substitute for judgement — this is a research / hobby model at the 500M scale, not a production system.

Downloads last month: 22

Safetensors

Model size

0.5B params

Tensor type

F32

Space using rootxhacker/HobbyLM-Diffusion 1

Collection including rootxhacker/HobbyLM-Diffusion

HobbyLM

Collection

8 items • Updated about 9 hours ago