auto-g-nano-1b

A from-scratch 1.05B parameter decoder-only Transformer, pre-trained on a mixed web + synthetic + code corpus. This is a research / educational model trained on a deliberately small token budget (13B tokens, roughly 60% of Chinchilla-optimal for 1B params), so it should be evaluated as a "hello world" 1B base model, not a production assistant.

Source code: https://github.com/geoffsee/auto-g-nano (branch claude/build-billion-param-model-cOPdo).

Architecture

Llama-style deeper-narrower decoder. RMSNorm + RoPE + Grouped-Query Attention

SwiGLU FFN. Untied embeddings.

field	value
total params	1,050,002,688 (1.050B)
layers	24
embedding dim	1792
query heads	14
KV heads (GQA)	2 (7× key/value sharing)
head dim	128
FFN hidden	5376 (3×d, SwiGLU)
context length	1024
vocab	50,257 (tiktoken `gpt2` BPE)
RoPE θ	500,000
precision	bf16 (training)

Training data

A 3-way interleaved stream (HF interleave_datasets with weights):

weight	source
0.40	`HuggingFaceTB/smollm-corpus` / `fineweb-edu-dedup`
0.25	`HuggingFaceTB/smollm-corpus` / `cosmopedia-v2`
0.35	`bigcode/starcoderdata` / `python` (gated)

The code share is intentionally aggressive (~~35%) compared to SmolLM's natural ratio (~~2%), so a 1B model trained on a small token budget actually picks up code patterns instead of treating them as noise.

Variants in this repo

file	size	notes
`model_latest.pt`	4.2 GB	base, fp32 (original training output)
`model_bf16.pt`	2.0 GB	base, bf16 PyTorch
`model.safetensors`	2.0 GB	base, bf16 safetensors (recommended for inference)
`model.onnx` + `.onnx.data`	2.0 GB	base, fp16 ONNX (CPU/MPS-friendly)
`model_sft.safetensors`	2.0 GB	SFT'd on `databricks/databricks-dolly-15k` — instruction-following

SFT variant

Brief supervised fine-tuning on the Dolly-15k instruction dataset (15,011 instruction/response pairs, Alpaca-style prompt template, loss masked to response tokens only). 3 epochs on 1× RTX PRO 6000 Blackwell Workstation Edition, AdamW with peak LR 5e-5 + cosine decay, ~17 min wall-clock.

Use Alpaca-style prompts with this checkpoint:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is photosynthesis?

### Response:

The base model produces incoherent output on instructions; the SFT variant attempts to answer them. It still has the underlying limitations of a 13B-token-pretrained 1B model — broken grammar, repetition, factual errors — but it stays on topic and follows the format. See the source repo's scripts/chat.py --format alpaca for a REPL.

Training procedure

field	value
hardware	2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (PCIe, no NVLink) on RunPod
framework	PyTorch 2.9 + HF `accelerate launch`, bf16 mixed precision
optimizer	AdamW, β=(0.9, 0.95), wd=0.1
LR schedule	linear warmup → cosine decay, peak ≈ 3e-4
per-proc batch	8
grad accumulation	16
global tokens / step	262,144 (8 × 16 × 1024 × 2 GPUs)
total iters	50,000
total tokens	~13.1B
wall-clock	~73 hours
NCCL	`NCCL_P2P_DISABLE=1` (PCIe-only Blackwell)

How to load

The published checkpoint is a state_dict saved with torch.save — it needs the GPT model class from the source repo to reconstruct.

import torch
from huggingface_hub import hf_hub_download
from model import GPT  # from the source repo

ckpt_path = hf_hub_download("geoffsee/auto-g-nano-1b", "model_latest.pt")
sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)

model = GPT(
    vocab_size=50257, n_embd=1792, n_layer=24, n_head=14, n_kv_head=2,
    ffn_hidden=5376, block_size=1024, dropout=0.0,
)
model.load_state_dict(sd)
model.eval()

Or just use the source repo's generate.py / scripts/test_inference.py.

Sample generations

Greedy / top-k sampling, T=0.8, top_k=40, 120 new tokens:

Once upon a time, in a small village near the mountains, lived two best friends named Timmy the Turtle and Sally the Squirrel. They loved exploring the forest together and learning new things! One sunny day, while walking through the forest, they stumbled upon a magical garden filled with colorful flowers. Timmy was excited to try something new and was curious about how his plants adapted to different types of rocks and soil…

def fibonacci(n):

    '''
    Return the number of times n can be Fibonacci numbers with a
    given number of factors.
    '''
    return int(n/2)

Question: Why is the sky blue? Answer: Blue is an invisible colour which makes the sky look blue. The human eye makes very little light when it is in the middle of the spectrum. Some parts of the spectrum are red while others are yellow…

The 12-prompt smoke test (6 prompts × 2 temperatures) scores 9 OK / 3 WEAK / 0 FAIL under the source repo's heuristic verdict checks.

Limitations

Undertrained. ~13B tokens is well below the Chinchilla-optimal ~21B for a 1B model, and orders of magnitude below modern best practice (Llama-3 / Mistral / SmolLM-1.7B all use 1T+).
Greedy / low-temperature sampling produces repetition loops — a classic undertrained-model failure. Use T ≥ 0.7 with top_k.
Hallucinates confidently. Will invent technical-sounding terms ("phototenoids") in factual contexts.
No instruction tuning. This is a base model only; it doesn't follow instructions, refuse harmful requests, or hold a chat.
Code generation is shallow. It produces syntactically valid Python but the semantics are often wrong.
English + Python only. Other natural languages and other programming languages are out-of-distribution.

Intended use

Research, educational reference for "what does a 1B Transformer trained from scratch on a modest budget actually look like", and as a starting point for fine-tuning experiments. Not suitable for any production or user-facing application.

License

Apache-2.0 for the weights and code. Data subsets each retain their own licenses — see HuggingFaceTB/smollm-corpus and bigcode/starcoderdata on the Hugging Face Hub.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

1B params

Tensor type

BF16

geoffsee
/

auto-g-nano-1b