Instructions to use ulanch/nanochat-d34-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ulanch/nanochat-d34-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ulanch/nanochat-d34-GGUF",
	filename="nanochat-d34-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use ulanch/nanochat-d34-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M

Use Docker

docker model run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use ulanch/nanochat-d34-GGUF with Ollama:
```
ollama run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M
```

Unsloth Studio new

How to use ulanch/nanochat-d34-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ulanch/nanochat-d34-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ulanch/nanochat-d34-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ulanch/nanochat-d34-GGUF to start chatting

Docker Model Runner
How to use ulanch/nanochat-d34-GGUF with Docker Model Runner:
```
docker model run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M
```

Lemonade

How to use ulanch/nanochat-d34-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ulanch/nanochat-d34-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.nanochat-d34-GGUF-Q4_K_M

List all available models

lemonade list

nanochat-d34 (karpathy/nanochat-d34) converted to GGUF, so you can run it under llama.cpp. The original is a $2,500 model, ~2.2B params, base-trained ~2X longer than Chinchilla (param:token ratio of 40 instead of 20)

Heads up first: stock llama.cpp refuses to load these. It has no idea what nanochat is and you'll get unknown architecture: nanochat. You need a build from this fork:

https://github.com/ulanch/llama.cpp (branch nanochat)

It's a tiny addition. one new file in src/models/ plus six small edits. For now you build from source.

d34 isn't quite a vanilla transformer, which is why the fork exists. The relevant bits:

parameterless RMS norm everywhere (no learnable γ)
QK-norm after RoPE, also parameterless
ReLU² in the FFN, no gate (up → relu(x)² → down, 4X expansion)
NEOX-style RoPE at base 10000, but with a sign-flipped sin convention vs ggml's NEOX (the fork passes freq_scale = -1.0 to compensate, that's the only ugly part of the implementation)
untied lm_head, no biases anywhere, logit softcap at 15 (same shape as Gemma2)

If you go reading gpt.py on current master of nanochat, watch out: it's diverged a lot. Smear gates, value embeddings, residual lambdas, none of that is in d34. d34 was trained at commit 2c4473d (Jan 11 2026), back when the architecture was much simpler. I wasted an hour matching the wrong file before realizing.

The files:

nanochat-d34-f32.gguf      8.3 GB   reference. matches a pure-PyTorch forward bit for bit.
nanochat-d34-bf16.gguf     4.2 GB   near-lossless half precision. use this, not fp16.
nanochat-d34-Q8_0.gguf     2.2 GB
nanochat-d34-Q6_K.gguf     2.1 GB
nanochat-d34-Q5_K_M.gguf   1.7 GB   matched the f32 greedy exactly on my test prompt
nanochat-d34-Q4_K_M.gguf   1.6 GB   typical default
nanochat-d34-IQ4_XS.gguf   1.3 GB
nanochat-d34-Q3_K_M.gguf   1.3 GB   starts to wander at this size, but still coherent

On the missing f16 file: nanochat's ReLU² FFN can produce activations above 65,504 at deep layers relu(x)² reaches ~88,000 by block 33 and llama.cpp's CPU fp16 matmul downcasts the fp32 activation to fp16 before multiplying, so those values overflow and you get NaN logits silently. bf16 has the full fp32 exponent range with the same byte size, no overflow, so it's the right "half precision" for this arch. That's why bf16 is here and f16 isn't.

To build the runtime:

git clone -b nanochat https://github.com/ulanch/llama.cpp.git
cd llama.cpp
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
    -DLLAMA_CURL=OFF -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF
cmake --build build -j 8 --target llama-cli llama-completion llama-server llama-quantize

LLAMA_BUILD_SERVER=ON is required even if you only care about llama-cli (it's gated on the server build upstream, no idea why). Metal on Apple Silicon and AVX2/512 on x86 are picked up automatically via -DGGML_NATIVE=ON, which is the default.

One-shot completion:

./build/bin/llama-completion -m nanochat-d34-Q5_K_M.gguf \
    -p "The capital of France is" -n 40 --temp 0 -no-cnv

-no-cnv is required, otherwise the CLI renders the chat template and drops you into an interactive session.

OpenAI-compatible server:

./build/bin/llama-server -m nanochat-d34-Q5_K_M.gguf -c 2048 --jinja --port 8080

--jinja is required too, the built-in chat-template parser doesn't recognize this template and crashes on startup. The Jinja parser handles it cleanly.

Want a quant I didn't ship (Q2_K, IQ3_M, anything else)? Grab the bf16 and do it locally:

./build/bin/llama-quantize nanochat-d34-bf16.gguf nanochat-d34-IQ3_M.gguf IQ3_M

About the model: this is a $2,500 base model with light chat finetuning. It's decent on general trivia and writes reasonable English, but you shouldn't expect it to reason hard. It also has odd phrasings, when I asked "Where is Paris located?" greedy decoding produced "Paris is the estate of France, capital of the French Republic, and the centre of...". "Estate" is a real (if archaic) word that means roughly that, but it's not what most people would write. The PyTorch reference does the same thing, that's just d34, probably a side effect of being long-trained on a slightly Wikipedia-leaning corpus.

License is MIT, matching upstream karpathy/nanochat. All the model work is Andrej's; I just wrote the converter, the llama.cpp arch file, and this page.

Downloads last month: 590

GGUF

Model size

2B params

Architecture

nanochat

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ulanch/nanochat-d34-GGUF

Base model

karpathy/nanochat-d34

Quantized

(2)

this model