🧠 Nano-nano v4.7

~296M · Qwen3-style · Custom BPE Tokenizer · ChatML + Thinking · 43 Datasets

License Params Datasets Thinking Tokenizer

Successor to Nano-nano v4.6 — redesigned with a custom corpus-trained BPE tokenizer, native thinking / chain-of-thought support, and a quality-tiered 43-dataset mix with sequence packing for 100% token efficiency.


📋 Summary

Property Value
Architecture Qwen3-style LLaMA decoder
Parameters ~296 M
Context 1 024 tokens (trained) / 2 048 (config max)
Tokenizer Custom BPE, vocab = 49 664
Chat format ChatML with <think> reasoning
Hardware NVIDIA GTX 1080 8 GB (Pascal)
Sequence packing ✅ 100% token utilisation

🏗️ Architecture

Qwen3-style decoder with GQA and QK-Norm, scaled for ~296 M parameters with a 32k-range custom tokenizer.

Hyperparameter v4.6 v4.7
Parameters ~256 M ~296 M
hidden_size 896 1 024
num_hidden_layers 15 20
num_attention_heads 14 16
num_key_value_heads 14 8 (GQA)
head_dim 64 64
intermediate_size 2 912 2 730
vocab_size 50 264 49 664 (custom)
max_position_embeddings 2 048 2 048
qk_norm
rope_theta 10 000 1 000 000
Tokenizer Nano-nano v4 Custom BPE
Chat format ### Instruction ChatML + <think>

🧩 Custom Tokenizer

Nano-nano v4.7 ships with a byte-level BPE tokenizer trained on the actual training corpus.

  • Vocab size: 49 664 (minimum 49 529, padded to ×128)
  • Byte-level: zero <unk> tokens — every unicode character is representable
  • ChatML specials baked in (not added after): <unk> <s> </s> <pad> <|im_start|> <|im_end|> <|system|> <|user|> <|assistant|> <think> </think>
  • Jinja2 chat template set for apply_chat_template() compatibility

Load with:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")

💭 Thinking / Chain-of-Thought

v4.7 is the first Nano-nano model with native thinking support.

The <think> and </think> tokens are part of the tokenizer vocabulary from the start (indices 9 & 10), so BPE never splits them.

Generation format:

<|im_start|>user
What is 17 × 13?<|im_end|>
<|im_start|>assistant
<think>
17 × 13 = 17 × 10 + 17 × 3 = 170 + 51 = 221
</think>
221<|im_end|>

Inference frameworks that open with <|im_start|>assistant\n<think>\n will prompt the model to reason before answering.


🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ray0rf1re/nano-nano_4.7",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")

def chat(prompt: str, think: bool = True, max_new_tokens: int = 512) -> str:
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True)
    if think:
        text += "<think>\n"   # open reasoning block
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens     = max_new_tokens,
        do_sample          = True,
        temperature        = 0.7,
        top_p              = 0.9,
        repetition_penalty = 1.1,
        pad_token_id       = tokenizer.eos_token_id,
    )
    new_ids = out[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(new_ids, skip_special_tokens=False).strip()

# With thinking
print(chat("Solve: if 3x + 7 = 22 what is x?"))

# Without thinking
print(chat("Write a haiku about coding.", think=False))

🍳 Training

Dataset Mix (43 datasets, quality-tiered)

Tier Dataset Samples Weight
1 Open-Orca/OpenOrca 40 k 3.0×
1 meta-math/MetaMathQA 30 k 2.8×
1 ray0rf1re/claude1255x9 10 k 2.8×
1 Roman1111111/claude-opus-4.6-10000x 10 k 2.5×
1 WizardLM/WizardLM_evol_instruct_V2_196k 25 k 2.5×
1 WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K 25 k 2.5×
1 KingNish/reasoning-base-20k 20 k 2.4×
1 bespokelabs/Bespoke-Stratos-17k 17 k 2.3×
1 NovaSky-UC-Berkeley/Sky-T1_data_17k 17 k 2.3×
1 open-thoughts/OpenThoughts-TB-dev 20 k 2.3×
1 truthful_qa 817 2.5×
2 microsoft/orca-math-word-problems-200k 20 k 2.2×
2 lighteval/MATH-Hard 10 k 2.2×
2 HuggingFaceH4/MATH-500 500 2.2×
2 ServiceNow-AI/R1-Distill-SFT 15 k 2.2×
2 open-r1/OpenR1-Math-220k 12 k 2.1×
2 garage-bAInd/Open-Platypus 25 k 2.0×
2 cognitivecomputations/dolphin-r1 6 k 2.0×
2 teknium/OpenHermes-2.5 30 k 2.0×
3 ise-uiuc/Magicoder-OSS-Instruct-75K 20 k 1.8×
3 m-a-p/CodeFeedback-Filtered-Instruction 15 k 1.8×
3 flytech/python-codes-25k 20 k 1.7×
3 iamtarun/python_code_instructions_18k_alpaca 8 k 1.6×
3 ByteDance-Seed/Code-Contests-Plus 15 k 1.6×
3 nvidia/OpenCodeInstruct 20 k 1.5×
3 ajibawa-2023/Code-74k-ShareGPT 25 k 1.6×
3 deepmind/code_contests 8 k 1.4×
3 b-mc2/sql-create-context 6 k 1.4×
3 jondurbin/airoboros-3.2 2 k 1.5×
4 HuggingFaceH4/ultrachat_200k 30 k 1.5×
4 ray0rf1re/archlinux-v1 10 k 2.0×
4 databricks/databricks-dolly-15k 15 k 1.2×
4 HuggingFaceH4/hhh_alignment 10 k 1.2×
4 Amod/mental_health_counseling_conversations 5 k 1.0×
4 mlabonne/guanaco-llama2-1k 1 k 1.0×
5 ray0rf1re/FineWeb-Nano 20 k 0.8×
5 fka/awesome-chatgpt-prompts 5 k 0.8×
5 ray0rf1re/AO3-2020 3 k 0.6×
5 Abirate/english_quotes 200 0.4×
6 Nix-ai/cat-math-v1 5 k 0.3×
6 Nix-ai/Cat-v2.8XXXL-plus 5 k 0.3×
6 HuggingFaceFW/fineweb-edu 5 1.0×
6 ray0rf1re/hyper-pip 85 3.0×

Settings

Setting Value
Hardware GTX 1080 8 GB · Pascal · CUDA 6.1
Precision fp32 weights / fp16 AMP
Context (training) 1 024 tokens
Context (inference) Up to 2 048 tokens
Sequence packing ✅ streaming BPE, 50k chunks
Optimizer StovetopCooker (HyperNix, pre-Volta)
LR 2e-4 cosine
Grad checkpointing
Boost system 2 main (75 steps) + 4 super (135 steps) + SOTFT

⚠️ Limitations

  • Context limited to 1 024 tokens during training (2 048 at inference)
  • Pascal GPU (GTX 1080): fp16 AMP only, no bf16
  • Not RLHF/DPO aligned — outputs may vary in safety and tone
  • Thinking quality proportional to training data quality

📜 Citation

@misc{nano-nano-47,
  author       = {ray0rf1re},
  title        = {Nano-nano v4.7: Qwen3-style LM with Custom Tokenizer and Thinking},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {https://huggingface.co/ray0rf1re/nano-nano_4.7},
}
Downloads last month
1,204
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ray0rf1re/nano-nano_4.7

Collection including ray0rf1re/nano-nano_4.7