HobbyLM-Diffusion (500M MoE, instruction-tuned text diffusion / LLaDA-style)

HobbyLM-Diffusion is the family's experiment in a different decoding paradigm: a masked-diffusion language model (LLaDA-style). Instead of generating left-to-right, it attends bidirectionally and fills in [MASK] tokens over a few iterative denoising passes โ€” so it can decode in parallel. This checkpoint is instruction-tuned: the diffusion base was chat-SFT'd on SmolTalk with a LLaDA-style objective (mask only the assistant response, denoise it conditioned on the clean prompt).

It's part of the HobbyLM family โ€” a 500M sparse-MoE model (and its variants) built from scratch on a hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine (hobby-rs) to run it on a laptop CPU.

Intended use

Experimental conversational generation via iterative denoising โ€” it's a research artifact, not a reliable assistant. Prompt it with the trained USER: / ASSISTANT: turn format. It adopts the chat register and the questionโ†’answer shape, but at 500M with a pure-diffusion objective it hallucinates and follows instructions loosely. Decode knobs trade quality vs speed; good defaults: temp 0โ€“0.3, steps โ‰ˆ 2ร— the generation length, repetition penalty 1.4โ€“1.5.

Architecture

Every HobbyLM variant shares one core: a sparse Mixture-of-Experts (MoE) decoder in the modern small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather than by guesswork.

Component Value
Total parameters ~500M (only a fraction is active per token)
Hidden size / layers 768 / 16 (first FFN dense, the rest MoE)
Routed experts / active 36 / top-6 (+ 1 always-on shared expert)
Attention GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm
Router sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm
Positional RoPE (ฮธ up to 1e6 for the 8k-context checkpoints)
Tokenizer GPT-2 byte-level BPE (50,304 vocab, sentinel-padded)
Optimizer Muon on the 2-D + per-expert matrices, AdamW on everything else

The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss; โ‰ฅ32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

Decoding

Generation is iterative bidirectional denoising of [MASK] tokens, not left-to-right AR. The GGUF carries diffusion.* metadata (mask-token id, block size) for a diffusion-aware runtime; hobby-rs implements the cached semi-autoregressive denoiser.

Benchmarks

A masked-diffusion model can't be scored by the standard log-likelihood lm-eval harness, so the meaningful numbers are training loss and decoding throughput โ€” where the diffusion paradigm actually shows up:

Metric Value
Validation loss (โ‰ˆ21B tokens) 3.52
Throughput โ€” H100, 128 tok, 32 steps 117.7 tok/s (~2.7ร— the AR model)
Throughput โ€” H100, AR baseline ~44 tok/s
Throughput โ€” laptop CPU (q8, cached) ~6.5 tok/s

The throughput result reproduces the Fast-dLLM literature's 2โ€“3ร— GPU range from a from-scratch implementation: on memory-bound hardware (GPU) batching the whole canvas is nearly free, so fewer denoising passes than tokens wins; on a compute-bound laptop the same code trails the AR engine. The knob is steps-per-token (quality โ†” speed).

A masked-diffusion LM at 500M trails an equal-scale autoregressive model on raw coherence โ€” the method is fully validated end-to-end here; the limit is capacity and tokens, not the recipe.

Usage

Python (PyTorch reference implementation)

HobbyLM is a custom sparse-MoE architecture โ€” there's no transformers AutoModel for it, so load it with the small reference implementation from the GitHub repo:

# HobbyLM-Diffusion is a MASKED-DIFFUSION model: generation is iterative, bidirectional denoising
# โ€” NOT autoregressive โ€” so it uses the reference diffusion sampler (not transformers.generate).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.diffusion import generate

repo = "rootxhacker/HobbyLM-Diffusion"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
cfg.expert_backend = "bmm"                          # "grouped" on CUDA
model = MoETransformer(cfg).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

enc = tiktoken.get_encoding("gpt2")
ids = torch.tensor([enc.encode_ordinary("The meaning of life is")])
# iterative denoising: gen_len tokens over `steps` bidirectional passes (more steps + lower temp = better)
out = generate(model, ids, gen_len=96, steps=128, temperature=0.2, rep_penalty=1.5, remask_steps=2)
print(enc.decode(out[0].tolist()))

GGUF + hobby-rs (CPU)

GGUF builds (architecture hobbylm) live in rootxhacker/HobbyLM-gguf. They load directly in the from-scratch hobby-rs CPU engine โ€” stock llama.cpp won't load them without registering the hobbylm architecture first.

hobby-rs --model HobbyLM-Diffusion.gguf --prompt "..." --n 64

Training

Two stages. Base: converted from the autoregressive 500M base (weights transfer; same architecture, attention switched to bidirectional) and adapted on ~21B tokens with a masked-token objective reweighted by 1/p_mask (a DiffuGPT/DiffuLLaMA-style conversion, val loss 3.52). Instruction tuning: chat-SFT on SmolTalk trajectories โ€” each assistant response is masked and denoised conditioned on the clean prompt.

Limitations

  • Hallucinates and follows instructions loosely โ€” the SFT shifts it into a conversational register and the Qโ†’A shape, but it does not reliably produce correct or on-task answers. This is the expected ceiling for a 500M pure-diffusion model; the limit is capacity, not the recipe.
  • Decoding quality is very sensitive to the sampler settings (see above).
  • The CPU throughput win only materializes on memory-bound hardware; on a thermally-limited laptop the AR model is faster.

License

Apache-2.0. Weights aren't a substitute for judgement โ€” this is a research / hobby model at the 500M scale, not a production system.

Downloads last month
22
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using rootxhacker/HobbyLM-Diffusion 1

Collection including rootxhacker/HobbyLM-Diffusion