auto-g-nano-1b

A from-scratch 1.05B parameter decoder-only Transformer, pre-trained on a mixed web + synthetic + code corpus. This is a research / educational model trained on a deliberately small token budget (13B tokens, roughly 60% of Chinchilla-optimal for 1B params), so it should be evaluated as a "hello world" 1B base model, not a production assistant.

Source code: https://github.com/geoffsee/auto-g-nano (branch claude/build-billion-param-model-cOPdo).

Architecture

Llama-style deeper-narrower decoder. RMSNorm + RoPE + Grouped-Query Attention

  • SwiGLU FFN. Untied embeddings.
field value
total params 1,050,002,688 (1.050B)
layers 24
embedding dim 1792
query heads 14
KV heads (GQA) 2 (7× key/value sharing)
head dim 128
FFN hidden 5376 (3×d, SwiGLU)
context length 1024
vocab 50,257 (tiktoken gpt2 BPE)
RoPE θ 500,000
precision bf16 (training)

Training data

A 3-way interleaved stream (HF interleave_datasets with weights):

weight source
0.40 HuggingFaceTB/smollm-corpus / fineweb-edu-dedup
0.25 HuggingFaceTB/smollm-corpus / cosmopedia-v2
0.35 bigcode/starcoderdata / python (gated)

The code share is intentionally aggressive (35%) compared to SmolLM's natural ratio (2%), so a 1B model trained on a small token budget actually picks up code patterns instead of treating them as noise.

Variants in this repo

file size notes
model_latest.pt 4.2 GB base, fp32 (original training output)
model_bf16.pt 2.0 GB base, bf16 PyTorch
model.safetensors 2.0 GB base, bf16 safetensors (recommended for inference)
model.onnx + .onnx.data 2.0 GB base, fp16 ONNX (CPU/MPS-friendly)
model_sft.safetensors 2.0 GB SFT'd on databricks/databricks-dolly-15k — instruction-following

SFT variant

Brief supervised fine-tuning on the Dolly-15k instruction dataset (15,011 instruction/response pairs, Alpaca-style prompt template, loss masked to response tokens only). 3 epochs on 1× RTX PRO 6000 Blackwell Workstation Edition, AdamW with peak LR 5e-5 + cosine decay, ~17 min wall-clock.

Use Alpaca-style prompts with this checkpoint:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is photosynthesis?

### Response:

The base model produces incoherent output on instructions; the SFT variant attempts to answer them. It still has the underlying limitations of a 13B-token-pretrained 1B model — broken grammar, repetition, factual errors — but it stays on topic and follows the format. See the source repo's scripts/chat.py --format alpaca for a REPL.

Training procedure

field value
hardware 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition (PCIe, no NVLink) on RunPod
framework PyTorch 2.9 + HF accelerate launch, bf16 mixed precision
optimizer AdamW, β=(0.9, 0.95), wd=0.1
LR schedule linear warmup → cosine decay, peak ≈ 3e-4
per-proc batch 8
grad accumulation 16
global tokens / step 262,144 (8 × 16 × 1024 × 2 GPUs)
total iters 50,000
total tokens ~13.1B
wall-clock ~73 hours
NCCL NCCL_P2P_DISABLE=1 (PCIe-only Blackwell)

How to load

The published checkpoint is a state_dict saved with torch.save — it needs the GPT model class from the source repo to reconstruct.

import torch
from huggingface_hub import hf_hub_download
from model import GPT  # from the source repo

ckpt_path = hf_hub_download("geoffsee/auto-g-nano-1b", "model_latest.pt")
sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)

model = GPT(
    vocab_size=50257, n_embd=1792, n_layer=24, n_head=14, n_kv_head=2,
    ffn_hidden=5376, block_size=1024, dropout=0.0,
)
model.load_state_dict(sd)
model.eval()

Or just use the source repo's generate.py / scripts/test_inference.py.

Sample generations

Greedy / top-k sampling, T=0.8, top_k=40, 120 new tokens:

Once upon a time, in a small village near the mountains, lived two best friends named Timmy the Turtle and Sally the Squirrel. They loved exploring the forest together and learning new things! One sunny day, while walking through the forest, they stumbled upon a magical garden filled with colorful flowers. Timmy was excited to try something new and was curious about how his plants adapted to different types of rocks and soil…

def fibonacci(n):

    '''
    Return the number of times n can be Fibonacci numbers with a
    given number of factors.
    '''
    return int(n/2)

Question: Why is the sky blue? Answer: Blue is an invisible colour which makes the sky look blue. The human eye makes very little light when it is in the middle of the spectrum. Some parts of the spectrum are red while others are yellow…

The 12-prompt smoke test (6 prompts × 2 temperatures) scores 9 OK / 3 WEAK / 0 FAIL under the source repo's heuristic verdict checks.

Limitations

  • Undertrained. ~13B tokens is well below the Chinchilla-optimal ~21B for a 1B model, and orders of magnitude below modern best practice (Llama-3 / Mistral / SmolLM-1.7B all use 1T+).
  • Greedy / low-temperature sampling produces repetition loops — a classic undertrained-model failure. Use T ≥ 0.7 with top_k.
  • Hallucinates confidently. Will invent technical-sounding terms ("phototenoids") in factual contexts.
  • No instruction tuning. This is a base model only; it doesn't follow instructions, refuse harmful requests, or hold a chat.
  • Code generation is shallow. It produces syntactically valid Python but the semantics are often wrong.
  • English + Python only. Other natural languages and other programming languages are out-of-distribution.

Intended use

Research, educational reference for "what does a 1B Transformer trained from scratch on a modest budget actually look like", and as a starting point for fine-tuning experiments. Not suitable for any production or user-facing application.

License

Apache-2.0 for the weights and code. Data subsets each retain their own licenses — see HuggingFaceTB/smollm-corpus and bigcode/starcoderdata on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train geoffsee/auto-g-nano-1b