🧠 Nano-Nano v5.1

~1218.3 M · Qwen3 · 300M · GQA + QK-Norm · Sequence-Packed · 26 Datasets

License Loss Eval Params Datasets

Fully redesigned successor to Nano-nano v4.5.
~298M Qwen3 parameters trained with sequence packing on a quality-tiered 34-dataset mix. Features loss-boost system: auto-extends training if loss > 4.95 (up to 3×75 steps).
Goal: loss < 2.5 through compute efficiency, not raw scale.


📋 Summary

Architecture LLaMA decoder-only
Parameters ~1218.3 M
Context 2 048 tokens
Vocabulary 50,304 tokens
Training loss 2.0444
Eval score 16.7%
Tokens trained 0.01 B (sequence-packed)
Hardware GTX 1080 8 GB (Pascal)

🏗️ Architecture (v4 → v4.5 → v5.1)

Hyperparameter v4 v4.5 v5.1
Parameters ~236 M ~256 M ~1218.3 M (~1.218 B)
hidden_size 896 896 1 024
intermediate_size 2 688 2 912 2 730 (8/3×hidden)
num_hidden_layers 14 15 16
num_attention_heads 14 14 16
num_key_value_heads 14 14 16
head_dim 64 64 64
vocab_size 50 264 50 264 50,304
max_position_embeddings 1 024 2 048 2 048
rms_norm_eps 1e-6 1e-6 1e-5
rope_theta 10 000 10 000 10 000
rope_scaling linear 2× linear 2×
tie_word_embeddings False False False
Sequence packing ✅ 1× packed
Architecture LLaMA LLaMA Qwen3
GQA (KV heads) 14 full 16 full 8 (16Q/8KV)
QK-Norm
rope_theta 10k 10k 1M

📊 Evaluation

Category Hits Score
Knowledge 0/5 🔴 0%
Reasoning 0/4 🔴 0%
Hallucination 0/4 🔴 0%
Instruction 2/4 🟡 50%
Coherence 1/3 🔴 33%
Overall 🔴 17%

Hallucination resistance tests whether the model correctly declines or hedges on unanswerable questions (future events, fictional entities, impossible premises).

Category Scores Hallucination Resistance Training Curves


🍳 Training

What's new in v5.1

Change v4.5 v5.1 Why
Sequence packing ❌ padding waste ✅ 100% tokens ~3× more signal per step
Dataset quality mixed web+instruction GPT-4 quality-tiered Faster loss reduction
Parameters ~256 M ~1218.3 M (~1.218 B) Better capacity
Datasets 15 21 More diversity
LR 1e-4 2e-4 1e-4 was too conservative

Settings

Setting Value
Hardware GTX 1080 8 GB · Pascal · CUDA 6.1
Precision fp32 weights / fp16 AMP (GradScaler)
Optimizer StovetopCooker (HyperNix, pre-Volta) + cosine
LR 0.0002 cosine
Warmup 8%
Embedding freeze First 20% of steps
Effective batch 8 × 512 = 4,096 tokens/step
Loss boost ≤3 rounds of 75 steps if loss > 4.95
Sequence packing ✅ streaming, 1× epochs, 150,000 chunks cap
Grad clipping 5.0
Grad checkpointing
Peak VRAM 5.44 GB
Final loss 2.0444

Dataset Mix (21 datasets, quality-tiered)

Tier Dataset Samples Weight Category
1 Open-Orca/OpenOrca 40 k 3.0× GPT-4 reasoning
1 meta-math/MetaMathQA 30 k 2.8× Math augmentation
1 Roman1111111/claude-opus-4.6-10000x 10 k 2.5× Claude conversations
1 WizardLM/WizardLM_evol_instruct_V2_196k 25 k 2.5× Evolved instruction
1 WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K 25 k 2.5× Reasoning traces
2 microsoft/orca-math-word-problems-200k 20 k 2.2× Math word problems
2 lighteval/MATH-Hard 10 k 2.2× Hard math
2 HuggingFaceH4/MATH-500 500 2.2× Competition math
2 garage-bAInd/Open-Platypus 25 k 2.0× Reasoning instruction
2 teknium/OpenHermes-2.5 30 k 2.0× GPT-4 instruction
3 ise-uiuc/Magicoder-OSS-Instruct-75K 20 k 1.8× Code instruction
3 m-a-p/CodeFeedback-Filtered-Instruction 15 k 1.8× Code + feedback
3 iamtarun/python_code_instructions_18k_alpaca 8 k 1.6× Python code
3 nvidia/OpenCodeInstruct 20 k 1.5× Code instruction
3 b-mc2/sql-create-context 6 k 1.4× SQL generation
4 HuggingFaceH4/ultrachat_200k 30 k 1.5× Multi-turn chat
4 databricks/databricks-dolly-15k 15 k 1.2× Instruction following
4 Amod/mental_health_counseling_conversations 5 k 1.0× Counseling chat
4 mlabonne/guanaco-llama2-1k 1 k 1.0× General QA
5 ray0rf1re/FineWeb-Nano 20 k 0.8× Web text
5 ray0rf1re/hyper-pip 85 3.0× HyperNix pip data
3 flytech/python-codes-25k 20 k 1.7× Python code solutions
3 ByteDance-Seed/Code-Contests-Plus 15 k 1.6× Competitive coding
1 open-thoughts/OpenThoughts-TB-dev 20 k 2.3× Verified thinking traces
6 Nix-ai/cat-math-v1 5 k 0.3× Cat math (niche)
6 Nix-ai/Cat-v2.8XXXL-plus 5 k 0.3× Cat general (niche)

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ray0rf1re/Nano-Nano_v5.1", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")

def chat(prompt: str, max_new_tokens: int = 256) -> str:
    # <think> opens the reasoning block; model outputs reasoning then </think> then answer
    text = ("<|im_start|>user
" + prompt + "<|im_end|>
"
            "<|im_start|>assistant
<think>
")
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs, max_new_tokens=max_new_tokens,
        do_sample=True, temperature=0.7, top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:],
                            skip_special_tokens=True).strip()

print(chat("Write a Python function to merge two sorted lists."))
print(chat("Solve: if 3x + 7 = 22, what is x?"))
print(chat("Explain transformer attention in simple terms."))

⚠️ Limitations

  • Context limited to 2 048 tokens
  • Trained on 0.01 B tokens — far below production scale
  • Pascal GPU (GTX 1080): fp16 AMP only, no bf16
  • Not RLHF/DPO aligned

📜 Citation

@misc{nano-nano-v5,
  author       = {ray0rf1re},
  title        = {Nano-Nano v5.1: 300M LLaMA with Sequence Packing},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {https://huggingface.co/ray0rf1re/Nano-Nano_v5.1},
}
Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ray0rf1re/Nano-Nano_v5.1

Collection including ray0rf1re/Nano-Nano_v5.1

Evaluation results