Instructions to use ray0rf1re/Nano-Nano_v5.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ray0rf1re/Nano-Nano_v5.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ray0rf1re/Nano-Nano_v5.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")
model = AutoModelForCausalLM.from_pretrained("ray0rf1re/Nano-Nano_v5.1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ray0rf1re/Nano-Nano_v5.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ray0rf1re/Nano-Nano_v5.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/Nano-Nano_v5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ray0rf1re/Nano-Nano_v5.1

SGLang

How to use ray0rf1re/Nano-Nano_v5.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ray0rf1re/Nano-Nano_v5.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/Nano-Nano_v5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ray0rf1re/Nano-Nano_v5.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ray0rf1re/Nano-Nano_v5.1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ray0rf1re/Nano-Nano_v5.1 with Docker Model Runner:
```
docker model run hf.co/ray0rf1re/Nano-Nano_v5.1
```

🧠 Nano-Nano v5.1

~1218.3 M · Qwen3 · 300M · GQA + QK-Norm · Sequence-Packed · 26 Datasets

Fully redesigned successor to Nano-nano v4.5.
~298M Qwen3 parameters trained with sequence packing on a quality-tiered 34-dataset mix. Features loss-boost system: auto-extends training if loss > 4.95 (up to 3×75 steps).
Goal: loss < 2.5 through compute efficiency, not raw scale.

📋 Summary


Architecture	LLaMA decoder-only
Parameters	~1218.3 M
Context	2 048 tokens
Vocabulary	50,304 tokens
Training loss	`2.0444`
Eval score	`16.7%`
Tokens trained	0.01 B (sequence-packed)
Hardware	GTX 1080 8 GB (Pascal)

🏗️ Architecture (v4 → v4.5 → v5.1)

Hyperparameter	v4	v4.5	v5.1
Parameters	~236 M	~256 M	~1218.3 M (~1.218 B)
`hidden_size`	896	896	1 024
`intermediate_size`	2 688	2 912	2 730 (8/3×hidden)
`num_hidden_layers`	14	15	16
`num_attention_heads`	14	14	16
`num_key_value_heads`	14	14	16
`head_dim`	64	64	64
`vocab_size`	50 264	50 264	50,304
`max_position_embeddings`	1 024	2 048	2 048
`rms_norm_eps`	1e-6	1e-6	1e-5
`rope_theta`	10 000	10 000	10 000
`rope_scaling`	—	linear 2×	linear 2×
`tie_word_embeddings`	False	False	False
Sequence packing	❌	❌	✅ 1× packed
Architecture	LLaMA	LLaMA	Qwen3
GQA (KV heads)	14 full	16 full	8 (16Q/8KV)
QK-Norm	❌	❌	✅
rope_theta	10k	10k	1M

📊 Evaluation

Category	Hits	Score
Knowledge	0/5	🔴 0%
Reasoning	0/4	🔴 0%
Hallucination	0/4	🔴 0%
Instruction	2/4	🟡 50%
Coherence	1/3	🔴 33%
Overall	—	🔴 17%

Hallucination resistance tests whether the model correctly declines or hedges on unanswerable questions (future events, fictional entities, impossible premises).

🍳 Training

What's new in v5.1

Change	v4.5	v5.1	Why
Sequence packing	❌ padding waste	✅ 100% tokens	~3× more signal per step
Dataset quality	mixed web+instruction	GPT-4 quality-tiered	Faster loss reduction
Parameters	~256 M	~1218.3 M (~1.218 B)	Better capacity
Datasets	15	21	More diversity
LR	1e-4	2e-4	1e-4 was too conservative

Settings

Setting	Value
Hardware	GTX 1080 8 GB · Pascal · CUDA 6.1
Precision	fp32 weights / fp16 AMP (GradScaler)
Optimizer	StovetopCooker (HyperNix, pre-Volta) + cosine
LR	`0.0002` cosine
Warmup	8%
Embedding freeze	First 20% of steps
Effective batch	8 × 512 = 4,096 tokens/step
Loss boost	≤3 rounds of 75 steps if loss > 4.95
Sequence packing	✅ streaming, 1× epochs, 150,000 chunks cap
Grad clipping	5.0
Grad checkpointing	✅
Peak VRAM	5.44 GB
Final loss	`2.0444`

Dataset Mix (21 datasets, quality-tiered)

Tier	Dataset	Samples	Weight	Category
1	`Open-Orca/OpenOrca`	40 k	3.0×	GPT-4 reasoning
1	`meta-math/MetaMathQA`	30 k	2.8×	Math augmentation
1	`Roman1111111/claude-opus-4.6-10000x`	10 k	2.5×	Claude conversations
1	`WizardLM/WizardLM_evol_instruct_V2_196k`	25 k	2.5×	Evolved instruction
1	`WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K`	25 k	2.5×	Reasoning traces
2	`microsoft/orca-math-word-problems-200k`	20 k	2.2×	Math word problems
2	`lighteval/MATH-Hard`	10 k	2.2×	Hard math
2	`HuggingFaceH4/MATH-500`	500	2.2×	Competition math
2	`garage-bAInd/Open-Platypus`	25 k	2.0×	Reasoning instruction
2	`teknium/OpenHermes-2.5`	30 k	2.0×	GPT-4 instruction
3	`ise-uiuc/Magicoder-OSS-Instruct-75K`	20 k	1.8×	Code instruction
3	`m-a-p/CodeFeedback-Filtered-Instruction`	15 k	1.8×	Code + feedback
3	`iamtarun/python_code_instructions_18k_alpaca`	8 k	1.6×	Python code
3	`nvidia/OpenCodeInstruct`	20 k	1.5×	Code instruction
3	`b-mc2/sql-create-context`	6 k	1.4×	SQL generation
4	`HuggingFaceH4/ultrachat_200k`	30 k	1.5×	Multi-turn chat
4	`databricks/databricks-dolly-15k`	15 k	1.2×	Instruction following
4	`Amod/mental_health_counseling_conversations`	5 k	1.0×	Counseling chat
4	`mlabonne/guanaco-llama2-1k`	1 k	1.0×	General QA
5	`ray0rf1re/FineWeb-Nano`	20 k	0.8×	Web text
5	`ray0rf1re/hyper-pip`	85	3.0×	HyperNix pip data
3	`flytech/python-codes-25k`	20 k	1.7×	Python code solutions
3	`ByteDance-Seed/Code-Contests-Plus`	15 k	1.6×	Competitive coding
1	`open-thoughts/OpenThoughts-TB-dev`	20 k	2.3×	Verified thinking traces
6	`Nix-ai/cat-math-v1`	5 k	0.3×	Cat math (niche)
6	`Nix-ai/Cat-v2.8XXXL-plus`	5 k	0.3×	Cat general (niche)

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ray0rf1re/Nano-Nano_v5.1", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")

def chat(prompt: str, max_new_tokens: int = 256) -> str:
    # <think> opens the reasoning block; model outputs reasoning then </think> then answer
    text = ("<|im_start|>user
" + prompt + "<|im_end|>
"
            "<|im_start|>assistant
<think>
")
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs, max_new_tokens=max_new_tokens,
        do_sample=True, temperature=0.7, top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:],
                            skip_special_tokens=True).strip()

print(chat("Write a Python function to merge two sorted lists."))
print(chat("Solve: if 3x + 7 = 22, what is x?"))
print(chat("Explain transformer attention in simple terms."))

⚠️ Limitations

Context limited to 2 048 tokens
Trained on 0.01 B tokens — far below production scale
Pascal GPU (GTX 1080): fp16 AMP only, no bf16
Not RLHF/DPO aligned

📜 Citation

@misc{nano-nano-v5,
  author       = {ray0rf1re},
  title        = {Nano-Nano v5.1: 300M LLaMA with Sequence Packing},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {https://huggingface.co/ray0rf1re/Nano-Nano_v5.1},
}

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

F16

Datasets used to train ray0rf1re/Nano-Nano_v5.1

Collection including ray0rf1re/Nano-Nano_v5.1

NANO-NANO

Collection

8 items • Updated about 9 hours ago

Evaluation results

Training Loss
self-reported

2.044
Overall Eval Score
self-reported

0.167
Knowledge
self-reported

0.000
Reasoning
self-reported

0.000
Hallucination Resistance
self-reported

0.000
Instruction Following
self-reported

0.500
Coherence
self-reported

0.333