Instructions to use ulanch/nanochat-d34-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ulanch/nanochat-d34-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ulanch/nanochat-d34-GGUF", filename="nanochat-d34-IQ4_XS.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ulanch/nanochat-d34-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ulanch/nanochat-d34-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ulanch/nanochat-d34-GGUF:Q4_K_M
Use Docker
docker model run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use ulanch/nanochat-d34-GGUF with Ollama:
ollama run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M
- Unsloth Studio new
How to use ulanch/nanochat-d34-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ulanch/nanochat-d34-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ulanch/nanochat-d34-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ulanch/nanochat-d34-GGUF to start chatting
- Docker Model Runner
How to use ulanch/nanochat-d34-GGUF with Docker Model Runner:
docker model run hf.co/ulanch/nanochat-d34-GGUF:Q4_K_M
- Lemonade
How to use ulanch/nanochat-d34-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ulanch/nanochat-d34-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.nanochat-d34-GGUF-Q4_K_M
List all available models
lemonade list
nanochat-d34 (karpathy/nanochat-d34) converted to GGUF, so you can run it under llama.cpp. The original is a $2,500 model, ~2.2B params, base-trained ~2X longer than Chinchilla (param:token ratio of 40 instead of 20)
Heads up first: stock llama.cpp refuses to load these. It has no idea what nanochat is and you'll get unknown architecture: nanochat. You need a build from this fork:
https://github.com/ulanch/llama.cpp (branch nanochat)
It's a tiny addition. one new file in src/models/ plus six small edits. For now you build from source.
d34 isn't quite a vanilla transformer, which is why the fork exists. The relevant bits:
- parameterless RMS norm everywhere (no learnable γ)
- QK-norm after RoPE, also parameterless
- ReLU² in the FFN, no gate (up → relu(x)² → down, 4X expansion)
- NEOX-style RoPE at base 10000, but with a sign-flipped sin convention vs ggml's NEOX (the fork passes
freq_scale = -1.0to compensate, that's the only ugly part of the implementation) - untied lm_head, no biases anywhere, logit softcap at 15 (same shape as Gemma2)
If you go reading gpt.py on current master of nanochat, watch out: it's diverged a lot. Smear gates, value embeddings, residual lambdas, none of that is in d34. d34 was trained at commit 2c4473d (Jan 11 2026), back when the architecture was much simpler. I wasted an hour matching the wrong file before realizing.
The files:
nanochat-d34-f32.gguf 8.3 GB reference. matches a pure-PyTorch forward bit for bit.
nanochat-d34-bf16.gguf 4.2 GB near-lossless half precision. use this, not fp16.
nanochat-d34-Q8_0.gguf 2.2 GB
nanochat-d34-Q6_K.gguf 2.1 GB
nanochat-d34-Q5_K_M.gguf 1.7 GB matched the f32 greedy exactly on my test prompt
nanochat-d34-Q4_K_M.gguf 1.6 GB typical default
nanochat-d34-IQ4_XS.gguf 1.3 GB
nanochat-d34-Q3_K_M.gguf 1.3 GB starts to wander at this size, but still coherent
On the missing f16 file: nanochat's ReLU² FFN can produce activations above 65,504 at deep layers relu(x)² reaches ~88,000 by block 33 and llama.cpp's CPU fp16 matmul downcasts the fp32 activation to fp16 before multiplying, so those values overflow and you get NaN logits silently. bf16 has the full fp32 exponent range with the same byte size, no overflow, so it's the right "half precision" for this arch. That's why bf16 is here and f16 isn't.
To build the runtime:
git clone -b nanochat https://github.com/ulanch/llama.cpp.git
cd llama.cpp
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=OFF -DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF
cmake --build build -j 8 --target llama-cli llama-completion llama-server llama-quantize
LLAMA_BUILD_SERVER=ON is required even if you only care about llama-cli (it's gated on the server build upstream, no idea why). Metal on Apple Silicon and AVX2/512 on x86 are picked up automatically via -DGGML_NATIVE=ON, which is the default.
One-shot completion:
./build/bin/llama-completion -m nanochat-d34-Q5_K_M.gguf \
-p "The capital of France is" -n 40 --temp 0 -no-cnv
-no-cnv is required, otherwise the CLI renders the chat template and drops you into an interactive session.
OpenAI-compatible server:
./build/bin/llama-server -m nanochat-d34-Q5_K_M.gguf -c 2048 --jinja --port 8080
--jinja is required too, the built-in chat-template parser doesn't recognize this template and crashes on startup. The Jinja parser handles it cleanly.
Want a quant I didn't ship (Q2_K, IQ3_M, anything else)? Grab the bf16 and do it locally:
./build/bin/llama-quantize nanochat-d34-bf16.gguf nanochat-d34-IQ3_M.gguf IQ3_M
About the model: this is a $2,500 base model with light chat finetuning. It's decent on general trivia and writes reasonable English, but you shouldn't expect it to reason hard. It also has odd phrasings, when I asked "Where is Paris located?" greedy decoding produced "Paris is the estate of France, capital of the French Republic, and the centre of...". "Estate" is a real (if archaic) word that means roughly that, but it's not what most people would write. The PyTorch reference does the same thing, that's just d34, probably a side effect of being long-trained on a slightly Wikipedia-leaning corpus.
License is MIT, matching upstream karpathy/nanochat. All the model work is Andrej's; I just wrote the converter, the llama.cpp arch file, and this page.
- Downloads last month
- 590
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for ulanch/nanochat-d34-GGUF
Base model
karpathy/nanochat-d34