Instructions to use ray0rf1re/nano-nano_4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ray0rf1re/nano-nano_4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ray0rf1re/nano-nano_4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")
model = AutoModelForCausalLM.from_pretrained("ray0rf1re/nano-nano_4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ray0rf1re/nano-nano_4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ray0rf1re/nano-nano_4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/nano-nano_4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ray0rf1re/nano-nano_4.7

SGLang

How to use ray0rf1re/nano-nano_4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ray0rf1re/nano-nano_4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/nano-nano_4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ray0rf1re/nano-nano_4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/nano-nano_4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ray0rf1re/nano-nano_4.7 with Docker Model Runner:
```
docker model run hf.co/ray0rf1re/nano-nano_4.7
```

🧠 Nano-nano v4.7

~296M · Qwen3-style · Custom BPE Tokenizer · ChatML + Thinking · 43 Datasets

Successor to Nano-nano v4.6 — redesigned with a custom corpus-trained BPE tokenizer, native thinking / chain-of-thought support, and a quality-tiered 43-dataset mix with sequence packing for 100% token efficiency.

📋 Summary

Property	Value
Architecture	Qwen3-style LLaMA decoder
Parameters	~296 M
Context	1 024 tokens (trained) / 2 048 (config max)
Tokenizer	Custom BPE, vocab = 49 664
Chat format	ChatML with `<think>` reasoning
Hardware	NVIDIA GTX 1080 8 GB (Pascal)
Sequence packing	✅ 100% token utilisation

🏗️ Architecture

Qwen3-style decoder with GQA and QK-Norm, scaled for ~296 M parameters with a 32k-range custom tokenizer.

Hyperparameter	v4.6	v4.7
Parameters	~256 M	~296 M
`hidden_size`	896	1 024
`num_hidden_layers`	15	20
`num_attention_heads`	14	16
`num_key_value_heads`	14	8 (GQA)
`head_dim`	64	64
`intermediate_size`	2 912	2 730
`vocab_size`	50 264	49 664 (custom)
`max_position_embeddings`	2 048	2 048
`qk_norm`	❌	✅
`rope_theta`	10 000	1 000 000
Tokenizer	Nano-nano v4	Custom BPE
Chat format	`### Instruction`	ChatML + `<think>`

🧩 Custom Tokenizer

Nano-nano v4.7 ships with a byte-level BPE tokenizer trained on the actual training corpus.

Vocab size: 49 664 (minimum 49 529, padded to ×128)
Byte-level: zero <unk> tokens — every unicode character is representable
ChatML specials baked in (not added after): <unk> <s> </s> <pad> <|im_start|> <|im_end|> <|system|> <|user|> <|assistant|> <think> </think>
Jinja2 chat template set for apply_chat_template() compatibility

Load with:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")

💭 Thinking / Chain-of-Thought

v4.7 is the first Nano-nano model with native thinking support.

The <think> and </think> tokens are part of the tokenizer vocabulary from the start (indices 9 & 10), so BPE never splits them.

Generation format:

<|im_start|>user
What is 17 × 13?<|im_end|>
<|im_start|>assistant
<think>
17 × 13 = 17 × 10 + 17 × 3 = 170 + 51 = 221
</think>
221<|im_end|>

Inference frameworks that open with <|im_start|>assistant\n<think>\n will prompt the model to reason before answering.

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ray0rf1re/nano-nano_4.7",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/nano-nano_4.7")

def chat(prompt: str, think: bool = True, max_new_tokens: int = 512) -> str:
    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True)
    if think:
        text += "<think>\n"   # open reasoning block
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens     = max_new_tokens,
        do_sample          = True,
        temperature        = 0.7,
        top_p              = 0.9,
        repetition_penalty = 1.1,
        pad_token_id       = tokenizer.eos_token_id,
    )
    new_ids = out[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(new_ids, skip_special_tokens=False).strip()

# With thinking
print(chat("Solve: if 3x + 7 = 22 what is x?"))

# Without thinking
print(chat("Write a haiku about coding.", think=False))

🍳 Training

Dataset Mix (43 datasets, quality-tiered)

Tier	Dataset	Samples	Weight
1	`Open-Orca/OpenOrca`	40 k	3.0×
1	`meta-math/MetaMathQA`	30 k	2.8×
1	`ray0rf1re/claude1255x9`	10 k	2.8×
1	`Roman1111111/claude-opus-4.6-10000x`	10 k	2.5×
1	`WizardLM/WizardLM_evol_instruct_V2_196k`	25 k	2.5×
1	`WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K`	25 k	2.5×
1	`KingNish/reasoning-base-20k`	20 k	2.4×
1	`bespokelabs/Bespoke-Stratos-17k`	17 k	2.3×
1	`NovaSky-UC-Berkeley/Sky-T1_data_17k`	17 k	2.3×
1	`open-thoughts/OpenThoughts-TB-dev`	20 k	2.3×
1	`truthful_qa`	817	2.5×
2	`microsoft/orca-math-word-problems-200k`	20 k	2.2×
2	`lighteval/MATH-Hard`	10 k	2.2×
2	`HuggingFaceH4/MATH-500`	500	2.2×
2	`ServiceNow-AI/R1-Distill-SFT`	15 k	2.2×
2	`open-r1/OpenR1-Math-220k`	12 k	2.1×
2	`garage-bAInd/Open-Platypus`	25 k	2.0×
2	`cognitivecomputations/dolphin-r1`	6 k	2.0×
2	`teknium/OpenHermes-2.5`	30 k	2.0×
3	`ise-uiuc/Magicoder-OSS-Instruct-75K`	20 k	1.8×
3	`m-a-p/CodeFeedback-Filtered-Instruction`	15 k	1.8×
3	`flytech/python-codes-25k`	20 k	1.7×
3	`iamtarun/python_code_instructions_18k_alpaca`	8 k	1.6×
3	`ByteDance-Seed/Code-Contests-Plus`	15 k	1.6×
3	`nvidia/OpenCodeInstruct`	20 k	1.5×
3	`ajibawa-2023/Code-74k-ShareGPT`	25 k	1.6×
3	`deepmind/code_contests`	8 k	1.4×
3	`b-mc2/sql-create-context`	6 k	1.4×
3	`jondurbin/airoboros-3.2`	2 k	1.5×
4	`HuggingFaceH4/ultrachat_200k`	30 k	1.5×
4	`ray0rf1re/archlinux-v1`	10 k	2.0×
4	`databricks/databricks-dolly-15k`	15 k	1.2×
4	`HuggingFaceH4/hhh_alignment`	10 k	1.2×
4	`Amod/mental_health_counseling_conversations`	5 k	1.0×
4	`mlabonne/guanaco-llama2-1k`	1 k	1.0×
5	`ray0rf1re/FineWeb-Nano`	20 k	0.8×
5	`fka/awesome-chatgpt-prompts`	5 k	0.8×
5	`ray0rf1re/AO3-2020`	3 k	0.6×
5	`Abirate/english_quotes`	200	0.4×
6	`Nix-ai/cat-math-v1`	5 k	0.3×
6	`Nix-ai/Cat-v2.8XXXL-plus`	5 k	0.3×
6	`HuggingFaceFW/fineweb-edu`	5	1.0×
6	`ray0rf1re/hyper-pip`	85	3.0×

Settings

Setting	Value
Hardware	GTX 1080 8 GB · Pascal · CUDA 6.1
Precision	fp32 weights / fp16 AMP
Context (training)	1 024 tokens
Context (inference)	Up to 2 048 tokens
Sequence packing	✅ streaming BPE, 50k chunks
Optimizer	StovetopCooker (HyperNix, pre-Volta)
LR	2e-4 cosine
Grad checkpointing	✅
Boost system	2 main (75 steps) + 4 super (135 steps) + SOTFT

⚠️ Limitations

Context limited to 1 024 tokens during training (2 048 at inference)
Pascal GPU (GTX 1080): fp16 AMP only, no bf16
Not RLHF/DPO aligned — outputs may vary in safety and tone
Thinking quality proportional to training data quality

📜 Citation

@misc{nano-nano-47,
  author       = {ray0rf1re},
  title        = {Nano-nano v4.7: Qwen3-style LM with Custom Tokenizer and Thinking},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {https://huggingface.co/ray0rf1re/nano-nano_4.7},
}

Downloads last month: 1,204

Safetensors

Model size

0.3B params

Tensor type

F32

Datasets used to train ray0rf1re/nano-nano_4.7

Collection including ray0rf1re/nano-nano_4.7

NANO-NANO

Collection

8 items • Updated 2 days ago