auto-g-nano-1b
A from-scratch 1.05B parameter decoder-only Transformer, pre-trained on a
mixed web + synthetic + code corpus. This is a research / educational model
trained on a deliberately small token budget (13B tokens, roughly 60% of
Chinchilla-optimal for 1B params), so it should be evaluated as a "hello
world" 1B base model, not a production assistant.
Source code: https://github.com/geoffsee/auto-g-nano (branch
claude/build-billion-param-model-cOPdo).
Architecture
Llama-style deeper-narrower decoder. RMSNorm + RoPE + Grouped-Query Attention
- SwiGLU FFN. Untied embeddings.
| field | value |
|---|---|
| total params | 1,050,002,688 (1.050B) |
| layers | 24 |
| embedding dim | 1792 |
| query heads | 14 |
| KV heads (GQA) | 2 (7× key/value sharing) |
| head dim | 128 |
| FFN hidden | 5376 (3×d, SwiGLU) |
| context length | 1024 |
| vocab | 50,257 (tiktoken gpt2 BPE) |
| RoPE θ | 500,000 |
| precision | bf16 (training) |
Training data
A 3-way interleaved stream (HF interleave_datasets with weights):
| weight | source |
|---|---|
| 0.40 | HuggingFaceTB/smollm-corpus / fineweb-edu-dedup |
| 0.25 | HuggingFaceTB/smollm-corpus / cosmopedia-v2 |
| 0.35 | bigcode/starcoderdata / python (gated) |
The code share is intentionally aggressive (35%) compared to SmolLM's
natural ratio (2%), so a 1B model trained on a small token budget actually
picks up code patterns instead of treating them as noise.
Variants in this repo
| file | size | notes |
|---|---|---|
model_latest.pt |
4.2 GB | base, fp32 (original training output) |
model_bf16.pt |
2.0 GB | base, bf16 PyTorch |
model.safetensors |
2.0 GB | base, bf16 safetensors (recommended for inference) |
model.onnx + .onnx.data |
2.0 GB | base, fp16 ONNX (CPU/MPS-friendly) |
model_sft.safetensors |
2.0 GB | SFT'd on databricks/databricks-dolly-15k — instruction-following |
SFT variant
Brief supervised fine-tuning on the Dolly-15k instruction dataset (15,011 instruction/response pairs, Alpaca-style prompt template, loss masked to response tokens only). 3 epochs on 1× RTX PRO 6000 Blackwell Workstation Edition, AdamW with peak LR 5e-5 + cosine decay, ~17 min wall-clock.
Use Alpaca-style prompts with this checkpoint:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is photosynthesis?
### Response:
The base model produces incoherent output on instructions; the SFT variant attempts
to answer them. It still has the underlying limitations of a 13B-token-pretrained 1B
model — broken grammar, repetition, factual errors — but it stays on topic and follows
the format. See the source repo's scripts/chat.py --format alpaca for a REPL.
Training procedure
| field | value |
|---|---|
| hardware | 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (PCIe, no NVLink) on RunPod |
| framework | PyTorch 2.9 + HF accelerate launch, bf16 mixed precision |
| optimizer | AdamW, β=(0.9, 0.95), wd=0.1 |
| LR schedule | linear warmup → cosine decay, peak ≈ 3e-4 |
| per-proc batch | 8 |
| grad accumulation | 16 |
| global tokens / step | 262,144 (8 × 16 × 1024 × 2 GPUs) |
| total iters | 50,000 |
| total tokens | ~13.1B |
| wall-clock | ~73 hours |
| NCCL | NCCL_P2P_DISABLE=1 (PCIe-only Blackwell) |
How to load
The published checkpoint is a state_dict saved with torch.save — it
needs the GPT model class from the source repo to reconstruct.
import torch
from huggingface_hub import hf_hub_download
from model import GPT # from the source repo
ckpt_path = hf_hub_download("geoffsee/auto-g-nano-1b", "model_latest.pt")
sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)
model = GPT(
vocab_size=50257, n_embd=1792, n_layer=24, n_head=14, n_kv_head=2,
ffn_hidden=5376, block_size=1024, dropout=0.0,
)
model.load_state_dict(sd)
model.eval()
Or just use the source repo's generate.py / scripts/test_inference.py.
Sample generations
Greedy / top-k sampling, T=0.8, top_k=40, 120 new tokens:
Once upon a time, in a small village near the mountains, lived two best friends named Timmy the Turtle and Sally the Squirrel. They loved exploring the forest together and learning new things! One sunny day, while walking through the forest, they stumbled upon a magical garden filled with colorful flowers. Timmy was excited to try something new and was curious about how his plants adapted to different types of rocks and soil…
def fibonacci(n):
''' Return the number of times n can be Fibonacci numbers with a given number of factors. ''' return int(n/2)
Question: Why is the sky blue? Answer: Blue is an invisible colour which makes the sky look blue. The human eye makes very little light when it is in the middle of the spectrum. Some parts of the spectrum are red while others are yellow…
The 12-prompt smoke test (6 prompts × 2 temperatures) scores 9 OK / 3 WEAK / 0 FAIL under the source repo's heuristic verdict checks.
Limitations
- Undertrained. ~13B tokens is well below the Chinchilla-optimal ~21B for a 1B model, and orders of magnitude below modern best practice (Llama-3 / Mistral / SmolLM-1.7B all use 1T+).
- Greedy / low-temperature sampling produces repetition loops — a classic undertrained-model failure. Use T ≥ 0.7 with top_k.
- Hallucinates confidently. Will invent technical-sounding terms ("phototenoids") in factual contexts.
- No instruction tuning. This is a base model only; it doesn't follow instructions, refuse harmful requests, or hold a chat.
- Code generation is shallow. It produces syntactically valid Python but the semantics are often wrong.
- English + Python only. Other natural languages and other programming languages are out-of-distribution.
Intended use
Research, educational reference for "what does a 1B Transformer trained from scratch on a modest budget actually look like", and as a starting point for fine-tuning experiments. Not suitable for any production or user-facing application.
License
Apache-2.0 for the weights and code. Data subsets each retain their own
licenses — see HuggingFaceTB/smollm-corpus and bigcode/starcoderdata
on the Hugging Face Hub.