Instructions to use ray0rf1re/nano-nano_4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ray0rf1re/nano-nano_4.7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ray0rf1re/nano-nano_4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7") model = AutoModelForCausalLM.from_pretrained("ray0rf1re/nano-nano_4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ray0rf1re/nano-nano_4.7 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ray0rf1re/nano-nano_4.7" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ray0rf1re/nano-nano_4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ray0rf1re/nano-nano_4.7
- SGLang
How to use ray0rf1re/nano-nano_4.7 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ray0rf1re/nano-nano_4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ray0rf1re/nano-nano_4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ray0rf1re/nano-nano_4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ray0rf1re/nano-nano_4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ray0rf1re/nano-nano_4.7 with Docker Model Runner:
docker model run hf.co/ray0rf1re/nano-nano_4.7
🧠 Nano-nano v4.7
~296M · Qwen3-style · Custom BPE Tokenizer · ChatML + Thinking · 43 Datasets
Successor to Nano-nano v4.6 — redesigned with a custom corpus-trained BPE tokenizer, native thinking / chain-of-thought support, and a quality-tiered 43-dataset mix with sequence packing for 100% token efficiency.
📋 Summary
| Property | Value |
|---|---|
| Architecture | Qwen3-style LLaMA decoder |
| Parameters | ~296 M |
| Context | 1 024 tokens (trained) / 2 048 (config max) |
| Tokenizer | Custom BPE, vocab = 49 664 |
| Chat format | ChatML with <think> reasoning |
| Hardware | NVIDIA GTX 1080 8 GB (Pascal) |
| Sequence packing | ✅ 100% token utilisation |
🏗️ Architecture
Qwen3-style decoder with GQA and QK-Norm, scaled for ~296 M parameters with a 32k-range custom tokenizer.
| Hyperparameter | v4.6 | v4.7 |
|---|---|---|
| Parameters | ~256 M | ~296 M |
hidden_size |
896 | 1 024 |
num_hidden_layers |
15 | 20 |
num_attention_heads |
14 | 16 |
num_key_value_heads |
14 | 8 (GQA) |
head_dim |
64 | 64 |
intermediate_size |
2 912 | 2 730 |
vocab_size |
50 264 | 49 664 (custom) |
max_position_embeddings |
2 048 | 2 048 |
qk_norm |
❌ | ✅ |
rope_theta |
10 000 | 1 000 000 |
| Tokenizer | Nano-nano v4 | Custom BPE |
| Chat format | ### Instruction |
ChatML + <think> |
🧩 Custom Tokenizer
Nano-nano v4.7 ships with a byte-level BPE tokenizer trained on the actual training corpus.
- Vocab size: 49 664 (minimum 49 529, padded to ×128)
- Byte-level: zero
<unk>tokens — every unicode character is representable - ChatML specials baked in (not added after):
<unk><s></s><pad><|im_start|><|im_end|><|system|><|user|><|assistant|><think></think> - Jinja2 chat template set for
apply_chat_template()compatibility
Load with:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")
💭 Thinking / Chain-of-Thought
v4.7 is the first Nano-nano model with native thinking support.
The <think> and </think> tokens are part of the tokenizer vocabulary from the start
(indices 9 & 10), so BPE never splits them.
Generation format:
<|im_start|>user
What is 17 × 13?<|im_end|>
<|im_start|>assistant
<think>
17 × 13 = 17 × 10 + 17 × 3 = 170 + 51 = 221
</think>
221<|im_end|>
Inference frameworks that open with <|im_start|>assistant\n<think>\n will prompt the
model to reason before answering.
🚀 Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ray0rf1re/nano-nano_4.7",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")
def chat(prompt: str, think: bool = True, max_new_tokens: int = 512) -> str:
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True)
if think:
text += "<think>\n" # open reasoning block
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens = max_new_tokens,
do_sample = True,
temperature = 0.7,
top_p = 0.9,
repetition_penalty = 1.1,
pad_token_id = tokenizer.eos_token_id,
)
new_ids = out[0][inputs["input_ids"].shape[-1]:]
return tokenizer.decode(new_ids, skip_special_tokens=False).strip()
# With thinking
print(chat("Solve: if 3x + 7 = 22 what is x?"))
# Without thinking
print(chat("Write a haiku about coding.", think=False))
🍳 Training
Dataset Mix (43 datasets, quality-tiered)
| Tier | Dataset | Samples | Weight |
|---|---|---|---|
| 1 | Open-Orca/OpenOrca |
40 k | 3.0× |
| 1 | meta-math/MetaMathQA |
30 k | 2.8× |
| 1 | ray0rf1re/claude1255x9 |
10 k | 2.8× |
| 1 | Roman1111111/claude-opus-4.6-10000x |
10 k | 2.5× |
| 1 | WizardLM/WizardLM_evol_instruct_V2_196k |
25 k | 2.5× |
| 1 | WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K |
25 k | 2.5× |
| 1 | KingNish/reasoning-base-20k |
20 k | 2.4× |
| 1 | bespokelabs/Bespoke-Stratos-17k |
17 k | 2.3× |
| 1 | NovaSky-UC-Berkeley/Sky-T1_data_17k |
17 k | 2.3× |
| 1 | open-thoughts/OpenThoughts-TB-dev |
20 k | 2.3× |
| 1 | truthful_qa |
817 | 2.5× |
| 2 | microsoft/orca-math-word-problems-200k |
20 k | 2.2× |
| 2 | lighteval/MATH-Hard |
10 k | 2.2× |
| 2 | HuggingFaceH4/MATH-500 |
500 | 2.2× |
| 2 | ServiceNow-AI/R1-Distill-SFT |
15 k | 2.2× |
| 2 | open-r1/OpenR1-Math-220k |
12 k | 2.1× |
| 2 | garage-bAInd/Open-Platypus |
25 k | 2.0× |
| 2 | cognitivecomputations/dolphin-r1 |
6 k | 2.0× |
| 2 | teknium/OpenHermes-2.5 |
30 k | 2.0× |
| 3 | ise-uiuc/Magicoder-OSS-Instruct-75K |
20 k | 1.8× |
| 3 | m-a-p/CodeFeedback-Filtered-Instruction |
15 k | 1.8× |
| 3 | flytech/python-codes-25k |
20 k | 1.7× |
| 3 | iamtarun/python_code_instructions_18k_alpaca |
8 k | 1.6× |
| 3 | ByteDance-Seed/Code-Contests-Plus |
15 k | 1.6× |
| 3 | nvidia/OpenCodeInstruct |
20 k | 1.5× |
| 3 | ajibawa-2023/Code-74k-ShareGPT |
25 k | 1.6× |
| 3 | deepmind/code_contests |
8 k | 1.4× |
| 3 | b-mc2/sql-create-context |
6 k | 1.4× |
| 3 | jondurbin/airoboros-3.2 |
2 k | 1.5× |
| 4 | HuggingFaceH4/ultrachat_200k |
30 k | 1.5× |
| 4 | ray0rf1re/archlinux-v1 |
10 k | 2.0× |
| 4 | databricks/databricks-dolly-15k |
15 k | 1.2× |
| 4 | HuggingFaceH4/hhh_alignment |
10 k | 1.2× |
| 4 | Amod/mental_health_counseling_conversations |
5 k | 1.0× |
| 4 | mlabonne/guanaco-llama2-1k |
1 k | 1.0× |
| 5 | ray0rf1re/FineWeb-Nano |
20 k | 0.8× |
| 5 | fka/awesome-chatgpt-prompts |
5 k | 0.8× |
| 5 | ray0rf1re/AO3-2020 |
3 k | 0.6× |
| 5 | Abirate/english_quotes |
200 | 0.4× |
| 6 | Nix-ai/cat-math-v1 |
5 k | 0.3× |
| 6 | Nix-ai/Cat-v2.8XXXL-plus |
5 k | 0.3× |
| 6 | HuggingFaceFW/fineweb-edu |
5 | 1.0× |
| 6 | ray0rf1re/hyper-pip |
85 | 3.0× |
Settings
| Setting | Value |
|---|---|
| Hardware | GTX 1080 8 GB · Pascal · CUDA 6.1 |
| Precision | fp32 weights / fp16 AMP |
| Context (training) | 1 024 tokens |
| Context (inference) | Up to 2 048 tokens |
| Sequence packing | ✅ streaming BPE, 50k chunks |
| Optimizer | StovetopCooker (HyperNix, pre-Volta) |
| LR | 2e-4 cosine |
| Grad checkpointing | ✅ |
| Boost system | 2 main (75 steps) + 4 super (135 steps) + SOTFT |
⚠️ Limitations
- Context limited to 1 024 tokens during training (2 048 at inference)
- Pascal GPU (GTX 1080): fp16 AMP only, no bf16
- Not RLHF/DPO aligned — outputs may vary in safety and tone
- Thinking quality proportional to training data quality
📜 Citation
@misc{nano-nano-47,
author = {ray0rf1re},
title = {Nano-nano v4.7: Qwen3-style LM with Custom Tokenizer and Thinking},
year = {2026},
publisher = {HuggingFace},
howpublished = {https://huggingface.co/ray0rf1re/nano-nano_4.7},
}
- Downloads last month
- 1,204