nanochat-d34 (karpathy/nanochat-d34) converted to GGUF, so you can run it under llama.cpp. The original is a $2,500 model, ~2.2B params, base-trained ~2X longer than Chinchilla (param:token ratio of 40 instead of 20)

Heads up first: stock llama.cpp refuses to load these. It has no idea what nanochat is and you'll get unknown architecture: nanochat. You need a build from this fork:

https://github.com/ulanch/llama.cpp (branch nanochat)

It's a tiny addition. one new file in src/models/ plus six small edits. For now you build from source.

d34 isn't quite a vanilla transformer, which is why the fork exists. The relevant bits:

  • parameterless RMS norm everywhere (no learnable γ)
  • QK-norm after RoPE, also parameterless
  • ReLU² in the FFN, no gate (up → relu(x)² → down, 4X expansion)
  • NEOX-style RoPE at base 10000, but with a sign-flipped sin convention vs ggml's NEOX (the fork passes freq_scale = -1.0 to compensate, that's the only ugly part of the implementation)
  • untied lm_head, no biases anywhere, logit softcap at 15 (same shape as Gemma2)

If you go reading gpt.py on current master of nanochat, watch out: it's diverged a lot. Smear gates, value embeddings, residual lambdas, none of that is in d34. d34 was trained at commit 2c4473d (Jan 11 2026), back when the architecture was much simpler. I wasted an hour matching the wrong file before realizing.

The files:

nanochat-d34-f32.gguf      8.3 GB   reference. matches a pure-PyTorch forward bit for bit.
nanochat-d34-bf16.gguf     4.2 GB   near-lossless half precision. use this, not fp16.
nanochat-d34-Q8_0.gguf     2.2 GB
nanochat-d34-Q6_K.gguf     2.1 GB
nanochat-d34-Q5_K_M.gguf   1.7 GB   matched the f32 greedy exactly on my test prompt
nanochat-d34-Q4_K_M.gguf   1.6 GB   typical default
nanochat-d34-IQ4_XS.gguf   1.3 GB
nanochat-d34-Q3_K_M.gguf   1.3 GB   starts to wander at this size, but still coherent

On the missing f16 file: nanochat's ReLU² FFN can produce activations above 65,504 at deep layers relu(x)² reaches ~88,000 by block 33 and llama.cpp's CPU fp16 matmul downcasts the fp32 activation to fp16 before multiplying, so those values overflow and you get NaN logits silently. bf16 has the full fp32 exponent range with the same byte size, no overflow, so it's the right "half precision" for this arch. That's why bf16 is here and f16 isn't.

To build the runtime:

git clone -b nanochat https://github.com/ulanch/llama.cpp.git
cd llama.cpp
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
    -DLLAMA_CURL=OFF -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF
cmake --build build -j 8 --target llama-cli llama-completion llama-server llama-quantize

LLAMA_BUILD_SERVER=ON is required even if you only care about llama-cli (it's gated on the server build upstream, no idea why). Metal on Apple Silicon and AVX2/512 on x86 are picked up automatically via -DGGML_NATIVE=ON, which is the default.

One-shot completion:

./build/bin/llama-completion -m nanochat-d34-Q5_K_M.gguf \
    -p "The capital of France is" -n 40 --temp 0 -no-cnv

-no-cnv is required, otherwise the CLI renders the chat template and drops you into an interactive session.

OpenAI-compatible server:

./build/bin/llama-server -m nanochat-d34-Q5_K_M.gguf -c 2048 --jinja --port 8080

--jinja is required too, the built-in chat-template parser doesn't recognize this template and crashes on startup. The Jinja parser handles it cleanly.

Want a quant I didn't ship (Q2_K, IQ3_M, anything else)? Grab the bf16 and do it locally:

./build/bin/llama-quantize nanochat-d34-bf16.gguf nanochat-d34-IQ3_M.gguf IQ3_M

About the model: this is a $2,500 base model with light chat finetuning. It's decent on general trivia and writes reasonable English, but you shouldn't expect it to reason hard. It also has odd phrasings, when I asked "Where is Paris located?" greedy decoding produced "Paris is the estate of France, capital of the French Republic, and the centre of...". "Estate" is a real (if archaic) word that means roughly that, but it's not what most people would write. The PyTorch reference does the same thing, that's just d34, probably a side effect of being long-trained on a slightly Wikipedia-leaning corpus.

License is MIT, matching upstream karpathy/nanochat. All the model work is Andrej's; I just wrote the converter, the llama.cpp arch file, and this page.

Downloads last month
590
GGUF
Model size
2B params
Architecture
nanochat
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ulanch/nanochat-d34-GGUF

Quantized
(2)
this model