Instructions to use respinosamena/Helios-Nova-306M-Instruct-2606 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="respinosamena/Helios-Nova-306M-Instruct-2606")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("respinosamena/Helios-Nova-306M-Instruct-2606", dtype="auto")

llama-cpp-python

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="respinosamena/Helios-Nova-306M-Instruct-2606",
	filename="Helios-Nova-306M-Instruct-2606-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

Use Docker

docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

LM Studio
Jan

vLLM

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "respinosamena/Helios-Nova-306M-Instruct-2606"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "respinosamena/Helios-Nova-306M-Instruct-2606",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

SGLang

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "respinosamena/Helios-Nova-306M-Instruct-2606" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "respinosamena/Helios-Nova-306M-Instruct-2606",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "respinosamena/Helios-Nova-306M-Instruct-2606" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "respinosamena/Helios-Nova-306M-Instruct-2606",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Ollama:
```
ollama run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
```

Unsloth Studio

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for respinosamena/Helios-Nova-306M-Instruct-2606 to start chatting

Atomic Chat new
Docker Model Runner
How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Docker Model Runner:
```
docker model run hf.co/respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M
```

Lemonade

How to use respinosamena/Helios-Nova-306M-Instruct-2606 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull respinosamena/Helios-Nova-306M-Instruct-2606:Q4_K_M

Run and chat with the model

lemonade run user.Helios-Nova-306M-Instruct-2606-Q4_K_M

List all available models

lemonade list

Helios Nova 306M-Instruct-2606

Helios Nova 306M-Instruct-2606 is a 306M-parameter, dense, decoder-only language model for instruction following and conversation. It is the reinforcement-learning-aligned release in the Helios Nova family: a from-scratch base model, instruction-tuned with supervised fine-tuning, then improved with Group Relative Policy Optimization (GRPO) using verifiable, rule-based rewards.

The model was developed independently and end-to-end by a single author — architecture, tokenizer, pre-training, post-training, and evaluation. It was designed to study capability per unit of compute at small scale: where sub-billion-parameter quality comes from architecture and data quality rather than from data volume alone.

At ~80× less pre-training data, Helios Nova reaches 96% of SmolLM2-360M on commonsense reasoning (Winogrande + PIQA), measured on an identical evaluation harness. The base model was pre-trained on 50B tokens on a single GPU for under USD 190 of compute.

The model is distributed both as GGUF quantizations (for llama.cpp: CUDA, Apple Metal, Vulkan, or CPU) and as full-precision safetensors (for PyTorch). Reference chat clients are provided in the companion GitHub repository.

Highlights

306M dense decoder, custom architecture and 16k tokenizer, built from scratch.
GRPO-aligned: instruction-following (constraint-following pass-rate) improved by +18.3 points over the SFT baseline with no measurable capability regression.
Data-efficient: 96% of SmolLM2-360M commonsense reasoning at ~80× fewer pre-training tokens.
Low cost: base pre-training under USD 190 on a single H100; post-training on a single consumer iGPU.
Runs anywhere: pure-PyTorch path (any OS/CPU) and GGUF/llama.cpp path (CUDA / Metal / Vulkan / CPU).

Usage

The reference clients live in the GitHub repository and download these weights automatically on first run.

git clone https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct-2606.git
cd Helios-Nova-306M-Instruct-2606

PyTorch (any operating system, CPU or GPU, no system dependencies):

pip install -r requirements.txt
python chat.py

GGUF via llama.cpp (fastest; CUDA, Apple Metal, AMD/Intel Vulkan, or CPU):

# install llama.cpp once — macOS: `brew install llama.cpp`;
# otherwise download a release for your backend from github.com/ggml-org/llama.cpp/releases
python instruct_chat.py             # F16 (default, full quality)
python instruct_chat.py --model q8  # Q8_0, near-lossless, ~2x smaller
python instruct_chat.py --model q4  # Q4_K_M, smallest and fastest (CPU / edge)

Both clients apply the exact training chat template and stop sequences, so generation terminates cleanly at the end of each turn.

Files

File	Size	Description
`Helios-Nova-306M-Instruct-2606-F16.gguf`	584 MB	Full precision (default)
`Helios-Nova-306M-Instruct-2606-Q8_0.gguf`	311 MB	Near-lossless
`Helios-Nova-306M-Instruct-2606-Q4_K_M.gguf`	179 MB	Smallest and fastest (CPU, edge)
`model.safetensors` (+ `config.json`, `HeliosNova.py`, tokenizer)	645 MB	bf16 weights for PyTorch

Model architecture

Component	Value
Parameters	305.8M (dense)
Layers / hidden size	24 / 1024 (depth-over-width, following the MobileLLM finding for sub-500M models)
Attention	Grouped-Query Attention — 16 query heads, 4 key-value heads, head dimension 64
Feed-forward	SwiGLU, intermediate size 3072
Positional encoding / norm	RoPE (theta 10,000), QK-Norm, RMSNorm (pre-norm), tied input/output embeddings
Tokenizer / context	Custom 16k BPE / 2048 tokens

Architecture diagram

Training

Pre-training (base model)

The base model, Helios-Nova-306M, was pre-trained on 50B tokens of FineWeb-Edu on a single NVIDIA H100 in under 120 hours, for under USD 190. It uses a Warmup-Stable-Decay (WSD) learning-rate schedule with fused AdamW, bf16, and torch.compile. The validation loss decreases throughout the stable phase and drops sharply during the final decay.

Pre-training validation loss Warmup-Stable-Decay schedule

Post-training (this model)

The post-training pipeline — supervised fine-tuning, Direct Preference Optimization (DPO), and GRPO — was implemented from scratch in pure PyTorch and run on a single AMD Strix Halo iGPU (ROCm, gfx1151), without TRL or bitsandbytes.

Supervised fine-tuning on smol-smoltalk with prompt masking. At 306M parameters, multi-epoch SFT induces catastrophic forgetting of base knowledge; training is stopped at approximately 0.5 epochs, at the point that balances instruction-following against retained general knowledge.

Catastrophic forgetting trade-off

Preference optimization. On-policy DPO preserved benchmark accuracy but did not improve held-out generation quality, because at this scale self-sampled candidates carry a weak preference signal. The objective was therefore changed to GRPO with verifiable, rule-based rewards (programmatically checkable instructions), which targets a capability the model can reliably improve. Constraint-following pass-rate rises smoothly during training while the KL divergence from the reference policy stays bounded.

Evaluation

Base model: data efficiency

All models below were re-run through one identical lm-evaluation-harness configuration (0-shot), so the comparison is internally consistent; these figures therefore differ slightly from each model's published numbers.

Capability versus pre-training token budget

Metric (0-shot)	Helios-306M (50B tok)	SmolLM2-360M (~4T)	Qwen2.5-0.5B (~18T)
Winogrande	57.2	57.9	56.3
PIQA	68.1	72.6	70.6
OpenBookQA	34.4	37.6	35.4
HellaSwag	44.7	52.5	49.5
ARC (avg)	42.8	53.4	45.5
MMLU	24.3	25.3	47.6
Commonsense reasoning (Winogrande + PIQA)	62.65	65.25	63.45

Helios reaches 96.0% of SmolLM2-360M on commonsense reasoning (Winogrande + PIQA) at roughly 80× less pre-training data, and ties it on Winogrande (99%). On MMLU the two models are within 96% of each other (24.3 versus 25.3); at this scale both sit near the 25% random-chance floor on MMLU, so this indicates parity rather than mastery. The model trails on tasks bounded by data volume — broad factual recall (TriviaQA) and exam-style knowledge, where Qwen2.5-0.5B's much larger curated corpus is decisive. Helios Nova is data-efficient, not knowledge-rich.

Full benchmark sweep

Post-training: SFT to GRPO

Each checkpoint was evaluated on the same seeded harness across three axes: capability retention, constraint-following pass-rate, and pairwise generation win-rate.

Stage	Capabilities (avg MC)	Constraint-following	Win-rate vs SFT
SFT (baseline)	0.371	39.1%	—
GRPO (this model)	0.371	57.4% (+18.3 pp)	52.7% (no regression)

SFT versus GRPO GRPO constraint-following during training

Intended use and limitations

Helios Nova 306M-Instruct-2606 is suitable for general conversation, instruction following, commonsense reasoning, format- and constraint-following, and on-device or CPU inference. It is a strong base for further fine-tuning, quantization, and compression research.

It is not suitable as a source of factual knowledge. A 306M-parameter model trained on 50B tokens of educational text has limited world knowledge, and performs near chance on broad factual recall (TriviaQA) and exam-style benchmarks (MMLU). Outputs may be inaccurate or outdated and should be verified before use; the model is not appropriate for high-stakes decisions. The model is English-only.

The Helios Nova family

Model	Description
Helios-Nova-306M	From-scratch base model (50B tokens)
Helios-Nova-306M-Instruct	Original SFT instruction model (PyTorch)
Helios-Nova-306M-Instruct-GGUF	GGUF build of the SFT instruction model
Helios-Nova-306M-Instruct-2606 (this model)	GRPO-aligned instruction model; GGUF and safetensors

Citation

@misc{espinosamena2026heliosnova2606,
  title  = {Helios Nova 306M-Instruct-2606: data-efficient pre-training and verifiable-reward GRPO on a single iGPU},
  author = {Espinosa Mena, Rafael},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/respinosamena/Helios-Nova-306M-Instruct-2606}}
}

Contact

Rafael Espinosa Mena — rafaelespinosamena@gmail.com

License

Downloads last month: 160

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for respinosamena/Helios-Nova-306M-Instruct-2606

Base model

respinosamena/Helios-Nova-306M

Quantized

(1)

this model

respinosamena
/

Helios-Nova-306M-Instruct-2606